Unicode Code-points

Page Index

Introduction

In Unicode, each character receives a "code-point" that is unique and will remain the same, regardless of computer platform, language or program. Unicode code-points are conventionally represented using the letter 'U' plus '+' followed by a hexadecimal value. Valid Unicode code points range from U+0000 to U+10FFFF. This number is stored in the computer and is used to refer to the character. For example, the Unicode code point for the Latin capital letter "A" is U+0041.

Version 4.0.0 of the Unicode Standard, developed by the Unicode Consortium, assigns a unique identifier to each of 96,382 characters (increased from 95,156 in version 3.2), covering the scripts of the world's principal written languages and many mathematical and other symbols. Most importantly for linguists, it includes the International Phonetic Alphabet.

Unicode assigns a unique number for every character (which can appear in three different encoding forms) and has space for encoding over one million characters. Most other character encodings are based on an 8-bit byte, which only provides space for 256 characters. This is a problem for languages that have more than 256 characters, which often have text encodings that are still based on the 8-bit byte but use additional, often complicated schemes to manage the sequences of bytes to represent the characters.

Varieties of Unicode

One of the more unusual things about Unicode is that it comes in three varieties (technically, "encoding forms"): UTF-8, UTF-16, and UTF-32, which differ by the size of the code unit manipulated by a computer when handling characters. Each of the encoding forms can represent exactly the same repertoire of characters, so they are all equivalent in that regard, but they are used in different computational contexts. UTF-32 is not widely encountered by end users and is not widely used for archiving data, though it is quite prevalent as a computational form used internally on Unix systems.

The two Unicode forms you will most often run into are UTF-16 (which is the form used in almost all current Windows software, including Word) and UTF-8 (which is the form most prevalent in databases and for Web use). The LINGUIST and E-MELD sites are implemented in UTF-8

UTF-8 was designed as a means of providing a form of Unicode that would work with existing software that handled strings in terms of 8-bit characters. It is defined in such a way that ASCII characters and the first 128 UTF-8 characters are exactly identical, so that ASCII can be interpreted as Unicode. ASCII characters consist of a single byte each, and cannot represent the tens of thousands of characters needed for a worldwide character encoding. UTF-8 adopts the strategy of previously existing large character sets, using a varying number of bytes to represent a single character: between one and a maximum of four per character, in fact. UTF-16 also uses a variable width strategy, using either one or two 16-bit code units to represent each character.

For a machine, conversion between these two kinds of Unicode is easy, though for a human being it is more problematic. If you use a Unicode character, and it appears as garbage in another Unicode file, it is probably because one is UTF-16 and the other UTF-8. Most of the time, however, this kind of conversion will be done automatically for you, by your browser, for example.

Converting encodings

Conversion between ASCII and ISO character sets and Unicode is usually easy. The first 128 Unicode code points (U+0000 to U+007F) have exactly the same values as the corresponding ASCII code points. This makes conversion between ASCII and Unicode straightforward: to get UTF-16 from ASCII, take the ASCII 7-bit value, and add zeros until you reach 16 bits. Similarly, the first 256 characters (U+0000 to U+00FF) are exactly the same as characters with the same code points from ISO 8859-1 (Latin 1). Again, to convert from Latin 1 to UTF-16, take the 8-bit byte of Latin-1, and add another byte containing all zeros so that you reach 16 bits.

Other character conversions are much more difficult. There are numerous utilities in existence for converting character sets to another. The Unix utility iconv will convert character sets between ISO 8859-X and Unicode, and the C3 system, developed by the Trans-European Research and Education Networking Association (TERENA) will convert between European character sets. But such conversion facilities are rarely of use to linguists, since they are designed for the conversion of standard sets of characters to other standard character-sets. They will thus convert Cyrillic into Latin, for example, or ISO-8859-X into Unicode. But linguists have in the past represented IPA either by using the non-Unicode encodings defined by such fonts as IPAKiel or the SIL font-suite, or with the (X)SAMPA alphabet in ASCII. Many have simply used arbitrary characters they themselves selected. These are hard to convert into Unicode, simply because of their arbitrariness.

The best comprehensive character conversion facility existing so far is one produced by SIL, called TECKit. However, this is a complex piece of software, and it requires some skill to use. It can also be modified to incorporate new mappings, but this is not easy to do. If you're interested in trying to do this yourself, there is a useful tutorial on the SIL site here. For the ordinary user, however, it is probably easier just to use a Unicode-aware piece of word-processing software like Word 2000 or XP, and globally replace characters by hand. The E-MELD project is currently developing a utility which will allow you to do simple mappings from one character set to Unicode; but as yet it is not ready for general use.

Find Character Conversion Software



The content of this page was developed from Deborah Anderson's presentation at the 2003 E-MELD workshop.

User Contributed Notes
Unicode Code-points
+ Add a comment
  + View comments

Back to top Credits | Glossary | Help | Navigation | Site Map | Site Search