Documentation in Danger
- The Domesday Project
- The Digital Dark Age
- Deterioration of Storage Media
- Hardware Obsolescence
- Software Obsolescence
- Lack of Intelligibility
As early as 1996, a special US Taskforce on Digital Archiving drew attention to the fact that changes in coding, formats, software programs, and operating systems have made much valuable data inaccessible to modern computers, and therefore useless (Garrett and Waters, 1996).
A case in point is the BBC Domesday Project, created in 1986, a cultural documentation project in which "a vast archive of material was collected, which included some 200,000 photographs, 24,000 maps, 8,000 data sets, and 60 minutes of moving pictures" (Brown, 2003). The material was collected and stored using the era's most modern technology. However, by 2000, the huge data store had become inaccessible, because many parts of the complex hardware/software combination were incompatible with modern computers. Ironically, as noted by one commentator, "after over nine centuries, the original Domesday Book can still be consulted. . . [but] the modern multimedia digital equivalent was unreadable after a mere decade and a half" (Brown, 2003).
Such examples have led some archivists to warn of an impending 'Digital Dark Age,' proclaiming that "due to the relentless obsolescence of digital formats and platforms . . . there has never been a time of such drastic and irretrievable information loss as right now. (Brand, 1999)
The long term accessibility and usability of EL documentation is threatened by multiple dangers: physical deterioration of both digital and non-digital storage media; obsolescence of computer hardware and operating systems, which makes the storage media inaccessible; obsolescence of software, which makes file formats unreadable; and undocumented annotation which may make the content uninterpretable.
Paper, audiotapes, videotapes, and computer diskettes are all prone to degradation and destruction. Moreover, many field notes and grammars currently reside on individual computers, vulnerable to disk crashes as well as file corruption. Some older notes and grammars still exist only in the form of notebooks and file cards. Because language data has historically been difficult to publish commercially, it may be stored negligently or even abandoned once the research based on it has been completed. Hence best practice involves the digitization of most non-digital material. But even digitized material is not impervious to threat.
Even though the physical media may endure, their future accessibility is threatened by the rapid pace of hardware obsolescence. As noted in Simons , in the past 25 years alone, removable media on personal computers have evolved from 8" floppies, through 5.25" floppies, 3.5" floppies, Zip drives, CD-Rs, to DVD-Rs; and to these we now can add memory sticks or flash drives. These changes have the advantage of producing media which are capable of containing ever greater amounts of information. But each advance threatens to make the earlier media obsolete and the information they contain inaccessible. Even if 8" floppies themselves exist today in pristine form, computers that can read them are almost non-existent.
Software has evolved with similar rapidity, with vendors changing file formats and functionality with each version. In the last 20 years, for example, there has been a plethora of formats in which to display textual material, whether in print or on the Internet: WordStar, WordPerfect, MSWord 1.0 - Word2003, LaTeX, PostScript, RTF, HyperCard, SGML, HTML 1.0 - 4.0, XHTML, XML, Hyper-G's HTF, Adobe's PDF, and countless more. Though some of these are open standards-i.e., they are publicly described so that new tools can be created to read them-most are proprietary formats readable only by the specific software that created them. Just as almost no one can now read an 8" floppy, almost no one can read a document created with MSWord 1.0. Even current versions of Word cannot access such files-indeed they can not access some types of material composed in Word 5.0.
It is not enough to make important EL documentation accessible to future generations of hardware and software. It must also be intelligible to a human being. Thus, the markup terminology used to describe linguistic structures also must be widely interpretable; and this, in turn, requires that the terminology be thoroughly documented or systematically related to a well-documented standard.
Currently, there are many different terminology sets in common use. Although the use of a single terminology set is an unattainable-and probably undesirable-goal, the lack of uniformity presents serious difficulties for both humans and machines. The documentation may be difficult to interpret in and of itself, since users must first learn new terminology before they can understand a new body of data; and if two bodies of data have incompatible markup, the datasets will be difficult to compare.
Moreover, a multiplicity of terminology sets impedes computational retrieval and analysis of the data. It inhibits machine searching for similar linguistic structures, since no search-engine can be expected to "know" that differently named entities are equivalent. And it hinders the development of general tools for language analysis, since each dataset requires different software.
BP in a Nutshell
What are Best Practices?
Why Follow BP?
Community Start Page
Linguist Start Page
Archivist Start Page