Gary F. Simons, SIL International
29 April 2003
EMELD Workshop on Digitizing and Annotating Texts and Field Recordings
11-13 July 2003, E. Lansing, MI
The EMELD proposal (section 3.1) lists the following among the objectives of the project:
One of the goals of this workshop is to begin the process of developing best practice recommendations for the markup and metadata description of annotated texts and field recordings. The purpose of this document is to lay out a roadmap for what needs to be developed in order to meet these objectives of the project. This is done by proposing requirements for the eventual solution and then enumerating consequent features of its implementation. But first I begin with some background definitions.
To create electronically encoded resources we will use markup languages. One conclusion of the first EMELD workshop in 2001 was that EMELD would recommend markup based on XML, the Extensible Markup Languagean information interchange standard of the World Wide Web Consortium (W3C 2000). Those unfamiliar with XML are referred to the Text Encoding Initiative's "Gentle Introduction to XML" (Sperberg-McQueen and Burnard 2001).
A markup language, like a natural language, has a lexicon, syntax, and semantics. The following terms are used throughout this paper to refer to the descriptive artifacts that document these three aspects of markup:
- markup vocabulary
Enumerates the lexical inventory of markup: i.e., the set of elements and attributes that are used in marking up a resource. (In practice, the vocabulary is enumerated within the markup schema rather than in a separate document.)
- markup schema
Specifies the syntax of markup: i.e., a formal grammar defining constraints on where elements and attributes must or may occur with respect to embedding and relative order. (This is typically realized in an XML DTD or an XML Schema, though other mechanisms are emerging.)
- markup metaschema
Specifies the semantics of markup: i.e., a formal mapping from elements and attributes to the linguistic concepts they represent. (This area of markup is not as well developed as the syntactic area, but is beginning to be developed under the impetus of the so-called Semantic Web; see W3C 2002.)
In this presentation of requirements the individual requirements are set apart as numbered statements in order to facilitate discussion. Similarly, the consequent features are set out as subordinate statements that bear an identification letter, as in:
The first requirement deals with the need for longevity of access far into the future. This aspect of language documentation and description is covered in detail in Bird and Simons (2002); only a few key points are noted here:
Microsoft Word documents provide an example of a proprietary, binary format that is not acceptable for long-term preservation of information. Plain text documents formatted with line breaks and spaces are an example of a format that meets requirements a through c; so are tab- or comma-delimited representations of spreadsheets or data tables. But most language resources have a more complex structure involving hierarchy and cross-reference, thus a more sophisticated representation is needed. Markup based on the XML standard meets all the above requirements and is now supported by such a wide variety of tools (both open and proprietary) that it has become the clear choice for archival formats. But what should the nature of the markup vocabulary be?
HTML markup, when applied to language resources, is an example of presentational markup. Though it does have the features of longevity needed for an archival format, it does not offer linguists the ability to do automated processing of a linguistic nature, such as to answer the query "What are the part-of-speech categories used in tagging this text?" For this purpose a markup vocabulary that specifically identifies the linguistic significance of each piece of information is needed. But simply having a markup vocabulary is not enough; each marked-up resource also has a grammar that defines how the individual markup elements may combine to form valid resource.
These consequences of requirement 3 thus mean that there will be multiple markup schemas, even in the context of best practice. In order to achieve interoperability of resources when there are multiple markup schemes we will need to introduce a meta-level in our approach to markup:
Finally, it is not enough that electronically encoded resources are created. They must also be found and used by others long into the future. This implies a final set of consequences having to do with archiving.
The Open Language Archives Community is already in place with an infrastructure that meets these needs, and EMELD will build on this infrastructure.
Taken together, the above requirements and the consequent features of implementation suggest the following shape for best practice with respect to the markup of texts and lexicons:
|Best Practice for Resource Creation||What the Community Must Do to Support Best Practice|
|Language documentation and description||Archive resource as an XML document that is valid with respect to a descriptive markup schema that is supplied with the resource.||1. Document characteristics of best practice descriptive
2. Recommend one or more markup schemas that meet these characteristics.
3. Develop stylesheets that do presentational rendering of resources that conform to these schemas.
|Metadescription for resource discovery||Provide OLAC metadata for the resource and deposit it with an OLAC data provider.||4. Define the OLAC metadata standard.
5. Define the controlled vocabulary for identifying language resource types in a refinement of the Dublin Core <type> element.
6. Develop a community service for resource discovery.
|Metadescription for resource interoperation||Provide a metaschema for the resource.||7. Define a common ontology of the concepts of language
8. Define the markup schema for a metaschema.
9. Develop metaschemas for the schemas recommended in point 2 above.
10. Develop a community service that uses metaschemas to provide interoperation across multiple language resources.
When the 10 community action steps listed in the last column have been completed, the "formulation" part of the EMELD objectives listed at the outset of this paper will have been met. The "promulgation" part will require additional work in areas like documentation, dissemination, and training.
Bird, Steven and Gary Simons, 2002. Seven Dimensions of Portability for Language Documentation and Description, Proceedings of the Workshop on Portability Issues in Human Language Technologies, Third International Conference on Language Resources and Evaluation, Las Palmas, Canary Islands. Available at: http://arxiv.org/abs/cs/0204020. Revised version: http://www.ldc.upenn.edu/sb/home/papers/0204020/0204020-revised.pdf
Langendoen, D. Terence and others, 2002. Publications of the EMELD Arizona group. Available at: http://emeld.douglass.arizona.edu:8080/group.html.
Simons, Gary F., 1998. Using architectural processing to derive small, problem-specific XML applications from large, widely-used SGML applications, SIL Electronic Working Papers 1998-006. Available at: http://www.sil.org/silewp/1998/006/.
Sperberg-McQueen, C. M. and Lou Burnard, 2001. A Gentle Introduction to XML. Chapter 2 of TEI P4: Guidelines for Electronic Text Encoding and Interchange, XML-compatible edition. TEI Consortium. Available at: http://www.tei-c.org/P4X/SG.html
W3C, 2000. Extensible Markup Language (XML) 1.0 (Second Edition), W3C Recommendation 6 October 2000. Available at: http://www.w3.org/TR/REC-xml.
W3C, 2002. The Semantic Web, an activity of the World Wide Web Consortium. Home page: http://www.w3.org/2001/sw/.