Lexicons and Word Lists
Lexicons and word lists constitute an important type of language documention. For many extinct languages, word lists are all that remain. It is therefore crucial that this lexical data is put into a long-lasting digital form before all record of these languages disappears completely.
There are many factors which make language documentation vulnerable to loss or corruption. Digital materials may become inaccessible if they are saved in closed proprietary formats which can only be read by a few rapidly-obsolescing pieces of software. Physical materials may be lost or accidentally destroyed, or may be unintelligible through a quirky annotation style or even bad handwriting. Material produced by field linguists is especially in jeopardy, because unless it has been published commercially, there are likely to be very few copies.
Documentation produced according to best practices avoids or minimizes these risks. We recommend that lexical data is digitized, converted to Unicode, and exported as a plain text file marked up as XML. This should be accompanied by another file explaining the structure of the XML file: a schema. Finally, your digital lexicon should be submitted to an archive. Plain text files are readable by many different pieces of software on many different operating systems and are the best defense against obsolescence.
Because of the difference in lexicon content and use, it's important to allow linguists to develop their own schemas. In the E-MELD project, interoperability among online lexicons will be achieved through the development of metaschemas that relate the lexicon terminology to concepts in the GOLD ontology.
Dictionaries can take a variety of formats; therefore there are a number of schemas developed to structure online lexicons. E-MELD has adopted a particular schema for XML output in FIELD.
'Interoperability' refers to different lexicons being compatible with each other, in terms of their structure and the terminology used. There are two enormous benefits in making lexicons interoperable. The first is considerably more powerful data searching. For example, you could search several lexicons for third person verb forms, without worrying about whether the lexicon creator used '3SG', '3 sing', 'thrd. sglr.' or whatever. This is what is sometimes referred to as a 'semantic' search, because the system is not just trying to match a keyword; it actually understands what you are looking for. The second benefit is that each lexicon is added to a knowledge base against which the consistency of the new document can be checked. Interoperability 'has the potential to greatly increase the precision of the original analyses, ... and ultimately to result in enriched resources' (Simons et al, 2004).
This interoperability can be achieved by creating a mapping from each individual lexicon schema to a common schema. The mappings are known as 'metaschemas', and there is one for every lexicon schema. The process is described more fully in Simons (2003) and Simons et al (2004). Essentially, a metaschema is written in XML according to a particular DTD (i.e. in a particular language, the Semantic Interpretation Language). The metaschema maps the linguist's original terminology to the GOLD ontology and standardizes the structure. An XSLT operation interprets the metaschema and produces an RDF document from which semantic searching of the lexicon is possible. As all the metaschemas are written according to the same DTD, the same XSLT interpreter can be used for each metaschema.
With regards to the terminology mapping, one of the principles of E-MELD is not to interfere with a linguist's terminology choice. Especially with regard to individual language families, we recognize that there are certain conventions in terminology use that might not match the output of GOLD. Therefore the transformation is done in the background, and the transformed file is used only for searching and archiving.
Tools to automatically create metaschemas are in development and the GOLD ontology is not yet finalized. While transforming your lexicon to a common RDF document follows the recommendations of best practices, we recognize that to do so currently requires a high level of skill and experience in web technologies. However, in time, we hope linguists will be encouraged to do this and thus help create a valuable and innovative linguistics resource.