Chu-Ren Huang, Institute of Linguistics, Academia Sinica

Taiwan's NDAP Language Archives Project: From bronze inscription texts to Austronesian field recording

Cui-Xia Weng, Ru-Yng Chang, Elizabeth Zeiton, Chao-Jun Chen,
Derming Juang, Chu-Ren Huang, and Chin-chuan Cheng

The Language Archives Project is part of Taiwan's National Digital
Archives Program (NDAP). The project digitizes and archives a
wide range of linguistic data, from heritage texts to endangered
Formosan languages. The goal is two-fold: both to preserve unique
cultural heritages and to provide a comprehensive linguistic
infrastructure to support content interpretation of archives. Based
on these two goals, the main challenges of this project are: to
provide versatile yet uniform presentation of different text types, to
account for language change, and to account for language
variation.

We take two archives of contrasting characteristics to illustrate how
these challenges are met. The Bronze Inscription archives deal with
an archaic language preserved in a written form that is significantly
different from Modern Chinese writing. The Formosan (i.e. Taiwan
Austronesian) archives deal with indigenous languages that are
endangered and have no written conventions. We show how
OLACMS lays the common ground for content documentation of
these contrasting archives.

First, for the Bronze Inscription Archives, the fundamental issue is
how to represent the archaic inscribed written form and to establish
the direct correspondences with modern writing systems at the
same time. We adopt the Intelligent Character Encoding Scheme to
deal with this issue. Basically, although glyph forms vary greatly, the composition of Chinese characters from basic glyph remains
regular. Hence an encoding scheme based on composition of basic
glyphs will not only help with diachronic Chinese archives but can
also deal with cross-lingual variations (e.g. Korean and Japanese
Kanji, new characters from Hong Kong, etc.).

Second, the Formosan languages are indigenous languages in
Taiwan that are also thought to be close to the common ancestor of
Austronesian languages. The first issue we face is that of
establishing orthography, which is solved by the common use of IPA
among field linguists. The second issue involves establishing
segmentation and tagging standards. The third issue involves
audio-representation of field recording. And the last issue involves
mapping the lexicon to GIS (geographic information system) to
represent language variations and contrasts.