Unicode for Language Documentation
- Unicode and the International Phonetic Alphabet
- Developing a Unicode Compliant Orthography
- Adding Characters to Unicode
- Precomposed forms
Linguists who are using the IPA or are dealing with languages that currently have no orthography can make use of the Unicode characters. Indeed, using already encoded characters will assure interoperability with current applications, especially if the glyphs for the characters are represented in fonts that are already widely used (i.e., Arial Unicode MS).
Though most IPA symbols are contained in the IPA Extensions block, characters also come from other blocks (Latin and Greek, for example). These characters are listed at the beginning of the names list in the IPA Extension block. A new Phonetic Extensions block has been added with Unicode 4.0, created primarily for the Uralic Phonetic Alphabet. Note that characters for linguistic transcriptions may also be created from a base character and characters contained in the Spacing Modifier Letters block or Combining Diacritics block.
The Unicode Character Names Index is also useful. It lists the formal character names, alternative character names, and character group names alphabetically.
When developing orthography, keep in mind these suggestions that will enable your language to be used with current software, keyboards and fonts:
- Use characters already in Unicode. Creating a new symbol or letter that doesn't exist in Unicode will make it difficult to use off-the-shelf software and fonts.
- In general, try to use orthography conventions already widely used. By doing so, you will have better access to keyboard input, availability of characters in fonts, and to default behavior for searching and sorting. While it might be tempting to develop an orthography deliberately different from others, this may actually work against you, because it will cut off your group from tools they need to further their own language literacy and usage. A better approach would be to build decorative differences in logos, design elements, and styles. Or, a group can get a specific font that would declare their identity in usage, but still not prevent email or other kinds of information systems.
- Avoid using symbols or punctuation as letters. Furthermore, don't use presentation forms and "letter-like" symbols that exist in Unicode.
- If you are using diacritics, use more common ones. If you are going to use barred letters, make use of the ones that are already encoded - they are there because they are already in significant use for other orthographies. Don't make up new ones.
- Don't use all capitals. Upper case Latin letters are difficult to read.
- Be sure to check the character properties of characters in Unicode before using them (i.e., don't pick a character from a right-to-left script if you are devising a left-to-right script; this will cause problems). Information on character properties is found in Chapter 4 of the Unicode Standard (available on the Unicode Consortium website) and at the Unicode Character Database.
The Unicode Standard offers a huge array of encoded characters that are able to serve most linguists' needs, and because they are already in Unicode -- which has been adopted by many software and font companies -- they can currently be used in documents. "Inventing" a new character is, however, not recommended, for problems will arise in short and long term accessibility (i.e., sending, receiving, and printing) such non-standard characters. Precomposed forms and ligatures are no longer eligible for encoding.
If you find you need a particular character that is not covered by Unicode, you are advised to work with the Script Encoding Initiative or directly with the Unicode Technical Committee to develop your proposal. Particularly helpful for proposals are copies of pages from books or journals that show a particular character in context (with the bibliographic information included). Though the full approval process can take several years, it will provide a means for others in the future to access the character in the international character encoding standard.
The Script Encoding Initiative, at the Department of Linguistics at the University of California at Berkeley, is dedicated to funding the development of scriplt proposals. It aims to avoid extensive revision of the proposal, or extensive involvement of the Unicode Technical Committee.
General guidelines on how to submit a proposal can be found on the Unicode Consortium website.
A precomposed form is a character that is made up of a series of characters. For example, kʷ is a precomposed form, made up of a k and a modifier letter, ʷ. In the early days of Unicode, being able to dynamically generate forms with a base character and combining mark was difficult or impossible, and many such precomposed forms were included in older characters sets. As a result, a number of these precomposed forms were added to Unicode. However, current rendering engines and fonts are able to create the base character and combining mark combinations dynamically and the position of the UTC is to rely on this productive method of composition, and to not encode more precomposed forms.
Similarly, ligatures, which are two (or more) glyphs fused together, are also not eligible for character encoding. In general, ligatures can be handled by a font or rendering engine. Six digraph ligatures are included in the IPA block (02A3-02A8). These have been included because they are defined in the IPA for the transcription of the coronal affricates and can be chosen by a transcriber in order to convey a semantic distinction about the phonetic status of the affricate.
Unicode for Language Documentation
How to Input Unicode