OCR or Keyboard?
Digitization of language documentation often involves conversion of paper records, such as lexicons, grammars, or narratives, to an electronic format. Scanning a document creates an image of the page, but in order to create a textual file that can be edited, OCR (Optical Character Recognition) software is needed. OCR reads in the images on the page and interprets them as characters, saving them in a format that the computer can search or index, such as Unicode or ASCII.
However, OCR is only useful for certain kinds of documents; for others, it is better to type in the data with the keyboard. Here are the questions to ask before deciding whether to OCR or to keyboard your documents.
Is the text handwritten?
- Current OCR applications only work on printed or typewritten text. Even the neatest handwritten text is too inconsistent to be read by OCR.
Is the text cleanly printed?
- Extraneous dots and other markings will interfere with the OCR application, leading to a higher error rate.
Does it contain non-standard characters, handwritten annotations, or modifications on typewritten text?
- Most OCR applications were written for Roman orthographies, and as noted above, cannot interpret handwritten text.
Are you prepared to spend time proofreading the output of the OCR application?
- OCR is not 100% accurate, even when run on the cleanest possible text. A certain amount of inaccuracy may be acceptable for certain purposes; for example, if the images are to be placed online, and the OCR output is merely to be used in the background for full-text searching, occasional errors may not matter. But in most cases, careful error correction will be needed in order to assure accuracy.
In summary, OCR can be a valuable tool for the digitization of cleanly printed documents using standard Roman orthographies, but it requires careful proofreading. When in doubt, it may be better to enter text through the keyboard.
OCR or Keyboard?