SIL International Home

SIL Electronic Working Papers 2006-003, March 2006
Copyright © 2006 Gary F. Simons and SIL International
All rights reserved.

An expanded version of a paper originally presented at the:
     EMELD Symposium on ”Endangered Data vs. Enduring Practice,”
     Linguistic Society of America annual meeting
     8-11 January 2004, Boston, MA

Ensuring that digital data last
The priority of archival form over working form and presentation form

Gary F. Simons
SIL International



  1. The paradox of development
  2. What’s a linguist to do?
  3. Characteristics of an enduring format
  4. Three levels of archival practice
  5. Illustrating levels of archival practice
  6. As rock solid as ASCII
  7. Conclusion



One of the great ironies of writing technology is that as technologies for writing become more advanced, the products of writing become less durable. The most enduring written records in history are those that were carved into stone by the ancients. By contrast, digital word processing is our most advanced writing technology to date, but it is also the most ephemeral. Hardware and software technologies are changing so rapidly that a typical storage medium or file format is obsolete within 5 to 10 years. Unless linguists take special measures to counter this, their digital records of endangered languages are in danger of dying out before the languages themselves.

A linguist must do two things in order to ensure that digital data endure: (1) the materials must be put into an enduring file format, and (2) the materials must be deposited with an archive that will make a practice of migrating them to new storage media as needed.  The paper addresses the first of these issues.  Most projects tend to focus on the working form  of data (that is, the form in which the materials are stored as they are worked on from day to day) and the presentation form (the form in which the materials will be presented to the public).  But these forms are closely tied to particular pieces of software and thus tend to become obsolete when the software does. The paper thus argues for the priority of the archival form (a form that is self documenting and software independent) as the object of language documentation. Many file formats for textual data are discussed and illustrated with the ultimate conclusion that descriptive XML markup represents best current practice for the archival form.

1. The paradox of development

One of the great ironies of writing technology is that as technologies for writing become more advanced, the products of writing become less durable. The most enduring written records from antiquity are those that were carved into stone or pressed into kiln-baked clay tablets. Writing on velum and papyrus was a great advance in that the process was faster and the resulting product was much less bulky; but it was also a step backwards on the durability scale since the medium could be destroyed by fire or by water or even by microbes.  With the modern use of paper, writing has advanced further, but it has become less durable yet as the chemicals used in the manufacture of paper can cause the medium to deteriorate from within, even in the best of storage conditions.

To complete the trend, digital word processing, which is our most advanced writing technology to date, is also the most ephemeral.  Whereas ink on acid-free paper will endure for centuries, the longevity of digital storage media is an order of magnitude shorter. The industry’s early answer to long-term digital storage was magnetic tape, but this has proved to have a life expectancy of only 10 to 20 years (Van Bogart 1995). The current answer, CD-R, fares better but is still ephemeral from an archival point of view.  Manufacturers report that CD-R discs should have a life expectancy of 100 to 200 years, but independent tests conducted at the National Institute of Standards and Technology found the life expectancy of the CD-R discs they tested to be 30 years (Byers 2003:13). The CD-RW medium is significantly less stable; the manufacturers predict a life expectancy of only 25 years. If the lab testing on CD-Rs is any indication, the actual life expectancy is probably more like 5 to 10 years. Byers (2003) gives an excellent description of how CD and DVD technologies work and how the media deteriorate over time.

But the problem is even worse than this, because the hardware devices that read these media become obsolete long before the media reach the end of their life expectancy. For instance, in the last 25 years we have seen removable media on personal computers advance from 8-inch floppies to 5.25-inch floppies to 3.5-inch floppies to Zip drives to CD-Rs to DVD-Rs. Unless one is diligent about migrating all of one’s legacy data to new media each time a new technology takes hold, those data will soon become trapped on media that no available hardware can read.

And the problem is worse yet, because software is changing, too.  Though software technology is not advancing as quickly as hardware technology, the effect of software change is more devastating since the migration strategy that works for keeping data files accessible on the latest media cannot ensure that the files remain usable.  This is because the functionality associated with those files is tied to particular software, and when the hardware that ran the needed software ceases to be available, then the functionality associated with those files ceases to exist.  The fact that software vendors may change the file formats and functionality with each new version of software only exacerbates the problem. 

When the results of our word processing are entrusted to the proprietary formats of a single software vendor, then we are completely at the mercy of that vendor as to whether our work will survive into the future. For instance, the author has a number of books and articles that were produced in the 1980s with Microsoft Word and its stylesheet feature (Simons 1989). The data files have been faithfully migrated over the years so that they remain readable today. However, current versions of Word no longer support stylesheets or the particular file format, with the result that the documents can no longer be rendered.  The text stream can still be retrieved with any plain text editor since the characters are encoded with the ASCII standard, but the formatting and layout are encoded in a proprietary binary format and thus are completely lost in the absence of software that understands that format.

The phenomenon of digital data loss has become so prevalent that many are beginning to warn of an impending “digital dark age”—the idea that historians of the future will look back to our present age as another Dark Ages since so much important information documenting our current civilization is recorded digitally and will have vanished (Bergeron 2002; Deegan and Tanner 2002). The popular press has chronicled many high-profile cases of digital data loss (Stepanek 1998; McKie and Thorpe 2002).  A recent Associated Press story quotes a technologist in the MIT library to relate a state of affairs that hits closer to home for the typical academic (Jesdanun 2003):

Every now and then, a faculty member would come in in tears having some boxes of completely unreadable tapes—they've lost their life's work.

The bottom line is that in these days of short-lived computer media, hardware, and software, linguists need to be particularly careful about the way they use digital technologies lest their work be lost within a decade or two.  In the absence of such diligence, our digital data records are even more endangered than the languages we are seeking to document.

2. What’s a linguist to do?

So what can a linguist do to ensure that digital documentation and description will endure long into the future?  The answer at one level is simple; there are just two things to do:

  1. Put the materials into an enduring file format.
  2. Deposit the materials with an archive that will make a practice of migrating them to new storage media as needed.

But at a deeper level, neither of these issues is simple. Fuller treatments of the issues involved in linguistic data archiving can be found in Bird and Simons (2003) and EMELD (2005). The remainder of this paper addresses just the first of these issues.

When considering file formats, we can contrast three classes of forms by their functions.

Working form
The form in which information is stored as it is created and edited.
Presentation form
The form in which information is presented to the public.
Archival form
The form in which information is stored for access long into the future.

Armed with these definitions we can address the fundamental problem.  It turns out, for reasons that are explained below, that popular working forms (like Microsoft Word and database applications) are not suitable as archival forms; neither are popular presentation forms (like dynamic web pages). Unfortunately, linguists tend to focus on working form and presentation form when they think about using digital technologies; instead, they must look beyond these forms to the archival form if they want to create work that will endure.

3. Characteristics of an enduring format

What makes a file format suitable for archiving? To just keep copying a file onto the latest storage medium is not good enough since the information in that file will be effectively lost when software for reading it is no longer available. As noted above, for instance, Microsoft Word files that are twenty years old can no longer be opened by modern versions of that program. How can we evaluate a present-day file format to predict if it will be readable well into the future? The following maxim will help:

An archival format should provide LOTS,

where LOTS is an acronym for:

An archival format should be lossless. That is, it should not lose any of the information that was in the original form.  For instance, many compression formats (like JPEG and MP3) achieve significant savings in file size by sacrificing some of the original information.  Such formats should be avoided in linguistic archiving. Another way that loss can occur is in analog-to-digital conversion of recordings and images.  Audio sampling rates under 44KHz lose the highest frequency components of the speech signal (Ladefoged 2003:26). Image scanning resolutions under 300 dots per inch give noticeable deterioration of image quality (MATRIX 2002). See EMELD (2005) for practical recommendations on audio sampling and image scanning.

An archival format should be open.  That is, the specification of the format should be openly documented and accessible to the public at large, rather than being the proprietary secret of a company or an individual.  In the case of a proprietary format, the information will be lost when the associated proprietary software is lost.  However, in the case of an open format, the information can still be interpreted in the future (even when associated software no longer works) as long as a copy of the open documentation can be found.  In the worst case, new software could be written to those specifications.

An archival format should be transparent. That is, if a scholar of the future were to open an archived file, the manner in which the information is encoded should be easy to understand and interpret, rather than requiring an act of genius to decipher. The first priority is that the format be open; then when choosing among possible open formats, a transparent format is to be preferred. For instance, a plain text file is transparent because there is a one-to-one correspondence between the stored numerical values and the characters they represent.  Similarly, an audio WAV file is transparent because there is a one-to-one correspondence between the stored numerical values and the amplitudes of the sound wave they represent. By contrast formats like ZIP and MP3, require the execution of a non-trivial algorithm to transform the sequence of stored numerical values into the characters or amplitudes they represent. This could prove an insurmountable obstacle to future users who no longer have access to the right software tools.

An archival format should be supported by multiple software suppliers. That is, the potential user of the archived resource should have many options when choosing software for reading and manipulating the resource.  When there is only one possible software supplier, then the longevity of the resource is tied to the whims and fortunes of that single source.  By contrast one can enhance the staying power of an archival form by choosing a format that is supported by many software vendors. Ideally, the range of suppliers will include open source projects ( so that potential users will be able to read the resource with no- or low-cost tools and so the source code for the software will remain openly available to the community beyond the life of its creator.

Table 1 summarizes the LOTS characteristics of some common file formats. The most enduring formats (and thus the best for archiving) are those that have a plus in every column.

Table 1: LOTS characteristics of common file formats

Format Lossless Open Transparent Suppliers
MP3 + +
WAV + + + +
JPEG + +
GIF + + +
TIFF + + + +
BMP + + + +
PDF + + +
ZIP + + +
TXT + + + +
RTF + + + +
HTML + + + +
CSV + + + +
XML + + + +

4. Three levels of archival practice

With these definitions in place of what makes a file format enduring, it is now possible to define and illustrate three levels of practice with respect to digital archiving: unacceptable practice, acceptable practice, and best practice.

First, there is unacceptable practice in which the archived form of the resource is not in an enduring format. Many a grant proposal has promised that the results of the research would be “archived” on a web site, but this is not archiving at all since it does not involve placing an enduring form of the results under the care of an institution that is committed to preserving it for access long into the future. A web site typically provides an ephemeral presentation form of language resources.  This is particularly so in the case of a site that generates displays dynamically from information stored in a database.  In such a resource, successful display of the presentation form depends on the interoperation of many components—the hardware, the operating system, the web server, the application server, the database server, the programming language interpreter.  When any one of these components is upgraded there is the potential of breaking the language resource; over time, as components of a web site are periodically upgraded, obsolescence of the dynamic language resource is inevitable.  This is why the long term view of archiving gives priority to the archival form over the presentation form—a language documentation project should produce an enduring archival form of its results that succeeding generations will be able to transform for presentation on the web using the latest presentation technologies.

Another unacceptable practice is to archive a working form that is in a non-transparent, proprietary format that is understood only by the software of a specific vendor.  This includes, for instance, the native formats of favorite commercial tools like the members of the Microsoft Office suite—the DOC format of a Word document, the XLS format of an Excel spreadsheet, the PPT format of a PowerPoint presentation, or the MDB format of an Access database. These formats are unacceptable for archiving because their information content is not decipherable without the aid of the proprietary single-source tool. (The non-transparency of such a format will be illustrated in the next section.)  Equally as problematic are non-commercial formats that are developed by a particular project and supported only by software developed by that project. In either case, the information will cease to be available at the point in the future when the existing computer systems are no longer able to run a version of the software that understands the format. 

Next, there is acceptable practice in which the archived form of the resource is in an enduring format. Formats listed with four plusses in Table 1 qualify as such. An archive should not accept materials in a non-enduring format, unless they first convert them to an enduring format or ask the linguist to do so.  Fortunately, commercial tools with proprietary formats typically have export functions.  For instance, a Microsoft Word document should never be archived in DOC format; instead, the Save As function should be used to convert the document to RTF or XML format, both of which preserve all of the text content and formatting in a transparent plain text format that future generations would be able to interpret. Some archives may also choose to accept formats that lack only the transparency feature. For instance, PDF (in spite of the fact that it was developed by a commercial company) has an openly published specification that is supported by the software of many vendors (including open source projects).  Thus many archives feel confident in preserving materials in that format.

The good news about archiving a format like PDF or RTF is that a snapshot of how the linguist presented the information will persist well into the future.  However, the bad news is that it is a dead end format since it enshrines just one particular presentation of the information—the information is not repurposeable. It is not in a form that can be loaded into another program for fresh analysis. It is not in a form that can be used to create other ways of presenting the same information or even to create derived information.

Finally, there is best practice in which the archived form is not only an enduring format but is also one that preserves the structure of the information in such a way that it can be repurposed.  Saving a database or spreadsheet in a plain text format like CSV is one example of such a format. So are systems of markup like the “Standard Format” system used in Shoebox or the hierarchical tags used in XML.  If the markup follows standards that have been used in many other language resources then those resources are not only repurposeable, but are interoperable as well.  That is, it is possible for a software service to operate in a uniform way over a whole collection of resources, such as to perform cross-linguistic searching.  (See, for instance, Simons 2005.)

Markup refers to the means by which the structure of information in a stream of characters is represented. For instance, in a dictionary the markup has to do with identifying the entries and the various parts within each entry. In a seminal work on the subject of markup, Coombs, Renear, and DeRose (1987) define and illustrate various approaches to markup—punctuational, presentational, procedural, descriptive—and give compelling arguments for the advantages of descriptive markup. (These arguments are summarized in Simons 1989 and section 6 of Simons 1998a.) Descriptive markup identifies what each text element is (that is, what conceptual class it is a member of), rather than specifying its display formatting or how it is to be processed.  Also known as generalized markup, this approach entered the mainstream through SGML (Standard Generalized Markup Language), which was adopted as ISO 8879 in 1986.  The Text Encoding Initiative’s Guidelines for the Encoding and Interchange of Machine-Readable Texts (Sperberg-McQueen and Burnard 1994) was a landmark application of SGML within a number of scholarly disciplines including linguistics.  Since that time a simplified version of SGML called XML (for Extensible Markup Language) has been promulgated as a recommendation of the World Wide Web Consortium (Bray, Paoli, and Sperberg-McQueen 1998), along with a host of related standards including XSL (for Extensible Stylesheet Language) which defines a functional programming language for processing XML data sources (Quin 2005).  The result of W3C adoption has been that XML is now supported by hundreds of software suppliers and has become ubiquitous as the standard mechanism for information interchange on the Internet.  When the EMELD project (Electronic Metastructures for Endangered Language Data) convened its first workshop on the need for standards in digital language archiving, the company of invited experts easily reached consensus that XML descriptive markup constituted best current practice for the archival form of digital language data (EMELD 2001).

5. Illustrating levels of archival practice

In this section, the three levels of archival practice are illustrated with a sample from a dictionary of Sikaiana, a Polynesian language of Solomon Islands.  The sample consists of the first ten entries of the dictionary compiled by William Donner (1987). Table 2 illustrates many possible formats in which this information might be submitted to an archive. Click on “View through Acrobat” in Table 2 to see the presentational form of the sample that is most likely to work on any computer.

The “Actual file contents” links in Table 2 open web pages that show what the file actually contains when viewed through a plain text editor.  This is exactly what the archive users of the future will see if they lack the specific software tool for which the format was designed. The links in the final column show what the file looks like if viewed through its associated software tool. (Note that these links will have the desired effect only if your system is configured to associate the file extensions with a software tool that can interpret the format.)

Table 2: Examples of various file formats as archival forms

Unacceptable: Neither open nor transparent
DOC Actual file contents View through Word
Acceptable: Open, but presentational and not transparent
PDF Actual file contents View through Acrobat
More acceptable: Open and transparent, but presentational
TXT Actual file contents View as plain text
RTF Actual file contents View through Word
WordML Actual file contents View through Word
View through XML viewer
Word HTM Actual file contents View through browser
Even more acceptable: Open and more transparent, but presentational
XHTML Actual file contents View through browser
View through XML viewer
Best practice: Open, transparent, and descriptive
XML Actual file contents View through XML viewer

The DOC format of Microsoft Word illustrates unacceptable practice. Since it is a proprietary format (rather than an open one), we cannot assume that software of the future will be able to interpret it. Viewing the actual file contents demonstrates that the format is not transparent, with the result that future generations would not be able to easily extract the information from the file.

The PDF format illustrates acceptable practice.  The actual file contents are similarly non-transparent, which means that the file contents would be opaque to future generations who wanted to extract information for further processing. However, the fact that the specification of the file format is open and is widely supported by both commercial and non-commercial software vendors, means that this presentational form is likely to survive long into the future.

More acceptable practice is to archive a presentation form that is both open and transparent. The simplest such form is to save the presentation as a TXT file (e.g. use the Plain Text format with the Save As command of Microsoft Word). This preserves the stream of characters (including the punctuational markup), but loses the formatting which in some cases signals important information (such as the use of superscripting to distinguish homophones or the use of italics to distinguish Sikaiana words from English words).  Saving a Word document in Rich Text Format (RTF) preserves all the formatting information in the DOC file in a plain text representation that is openly documented (Microsoft 2004).  The format is idiosyncratic and verbose, with the result that the sample RTF file is four times the size of the sample TXT file.  However, it would be possible for future generations to write a script that would extract the useful information. Even more verbose is WordML (for Word Markup Language); it is an XML vocabulary developed by Microsoft (2005) for representing everything within a Word document.  It is created by choosing “XML” as the format with the Save As command of Microsoft Word. The bad news is that the WordML file is three times the size of the RTF file, but the good news is that it is a well-formed XML document that future generations could process with standard XML tools (like XSLT) to extract the useful information. Another acceptable format that preserves formatting is the HTML format that can be saved by Word; use the “Web Page, Filtered” format to eliminate most of the Word-specific formatting information.  This format is much less verbose than WordML and can be rendered by any Web browser.  Note that even though it looks like XML tagging, it is not a well-formed XML file.

Even more acceptable than the preceding formats is XHTML, a format that applies the discipline of XML well-formedness to the HTML markup vocabulary used by Web browsers.  The result is a format that can both be rendered by any browser and be processed with standard XML tools. Microsoft Word does not produce this format.  It is an ideal target for a situation in which the working form of the dictionary is a database and the presentation form will be dynamically generated. The “Actual file contents”  link shows that the format uses transparent presentational markup to identify paragraphs (with the <P> tag), anchors that support hyperlinking (with the <A> tag), boldface (with the <B> tag), and italics (with the <I> tag). Clearly, future generations could easily decipher this markup to recreate the presentation.

But best practice would be to generate an archival form based on descriptive XML markup from the dictionary database, or even to create and maintain the dictionary as an XML document using a Document Type Definition (DTD) with an XML editor like XMLSpy Home Edition ( or the <oXygen/> XML editor (  The DTD used in this sample is explained in Simons (1998b) and is based on the Text Encoding Initiative’s DTD for print dictionaries (Sperberg-McQueen and Burnard 1994).

In descriptive markup, the markup tags describe not the presentation formatting, but the structure and function of the information elements. Clicking on the “Actual file contents” link we see that the tags specifically identify entries, homophone numbers, headword forms, etymologies, senses, parts of speech, grammatical notes, definitions, usage notes, examples, translations, cross-references, Sikaiana words in definitions and notes, and more. Not only are formatting concepts like paragraph, bold, italic, and superscript completely absent from the underlying information, so are the elements of punctuational markup (like the parentheses around etymologies, the brackets around grammatical information, the comma after an example and the single quotes around its gloss).  These represent display choices of the publication designer and not fundamental information content. The real information is embodied in the descriptive markup tags that identify the function of each part of the entry. It is clear that future generations (even though they lack our current working tools) will be able to see and understand all of the information that was in our dictionary database. They will then be able to transform it into whatever form is needed to load it into their working tools.  They will be able to reuse the information to create new and up-to-date presentation forms, to serve as the starting point for more work on the same language, or to explore in cross-linguistic searching applications.

Table 3 offers a demonstration of how descriptive markup supports repurposing. It shows three different presentation forms of the best practice archival form from the end of Table 2. The first presentation in Table 3 shows all of the information in the Sikaiana dictionary entries, the second presentation shows a simplified form with just headwords and definitions, and the third presentation gives an etymological view in which the Proto-Polynesian etymon is the headword. In all three cases, the source information is read from sample-best.xml (shown in the last row of Table 2).  “Actual file contents” shows the raw XML input that uses a general entity declaration and reference (&data;) to pull in the archival data file and an <?xml-stylesheet?> processing instruction to attach an XSL stylesheet.  The only thing that is different in the three inputs is the stylesheet file, which can be seen by clicking on “Stylesheet”.  (If you do not have an application assigned to the XSL extension, choose a plain text editor like Notepad or an XML viewer like Internet Explorer.)  Clicking on “View through browser” will cause your web browser to dynamically generate the presentation form.  This works in Internet Explorer and other browsers that are fully compliant with the XML and XSL standards.  If that does not work in your browser, click on “Transformed to HTML” to see a static form of the presentation that has been pregenerated and saved.

Table 3: Demonstration of descriptive markup and repurposing

Presentation Input Output
Full content Actual file contents
View through browser
Transformed to HTML
Simplified Actual file contents
View through browser
Transformed to HTML
Etymologies Actual file contents
View through browser
Transformed to HTML

Simons, Olson, and Frank (2006) and Frank and Simons (2006) give further examples of archiving language documentation in an XML descriptive markup and then using an XSL stylesheet to produce a presentation form for the web.

6. As rock solid as ASCII

After giving this pitch for descriptive XML markup as the best practice for creating enduring results, a common response from critically thinking linguists has gone something like this: “How do we know that XML isn’t just another one of those ephemeral file formats?” To which my reply is something like, “I don’t think we need to worry about this—XML is as rock solid as ASCII.”  I liken it to the New Testament parable of the foolish man who built his house upon the sand versus the wise man who built his house upon the rock (Matthew 7:24–27). Archiving our language documentation and description in proprietary binary formats is like building our legacy on the shifting sands of someone else’s whim and fortune, while archiving in descriptive XML builds on the solid foundation of ASCII.

In digital storage, all information is ultimately reduced to a sequence of binary numbers. So how do we encode writing? ASCII (or the American Standard Code for Information Interchange) is the standard that says, for instance, that the number 65 will be used in a digital data stream to represent a capital A, 66 will represent B, and so on. In the early years of computers, there was no standard for how to represent writing on a computer. Each hardware manufacturer developed a different scheme; it soon became apparent that the resulting impediments to hardware interoperability and information sharing were not in anyone’s best interest. The leading manufacturers in the US got together and in 1963 ASCII was adopted as the common standard (Brandel 1999). It was elevated to the status of international standard, as ISO 646, in 1972. Though ASCII started life as a standard for encoding information written in English, it has come to transcend that single language and become the standardization of the Roman alphabet.  It remains unchanged as the lower half of all the character sets in the ISO 8859 family of standards adopted in 1987 for handling all the major languages of Europe. And it remains unchanged as the bottom code page in ISO 10646 (also known as Unicode)—the industry standard for character encoding that can now represent all major writing systems of the world (Unicode Consortium 2003).

Four decades after its adoption, the ASCII character code lies at the heart of the global information society. It is the Information Age’s version of the Roman alphabet.  Just as the graphic forms of the Roman capital letters were standardized for all time when they were carved into stone millennia ago by the ancients, so I believe that historians of the future will look back to the adoption of ASCII as the moment in time when the digital form of that alphabet was similarly carved into stone. It is reasonable to assume a similar degree of durability.

The default character set for XML is Unicode, and it is fundamentally built on ASCII.  All of the metacharacters that have a special function within XML are confined to the ASCII character set. Furthermore, there is a standard transformation by which any Unicode character can be represented as a character reference which transparently (though more verbosely) encodes the Unicode character as a sequence of ASCII characters.  Thus, as long as there exist plain text editors that can view any ASCII data stream, the capacity of future generations to open and interpret XML data files will persist.

The genius of XML is that it can use an ASCII data stream to overcome the two main limitations of ASCII. First, because it incorporates Unicode, XML can encode text in virtually any language, not just English text. Second, because it supports user-defined descriptive tags, XML can transparently encode the structure of information, not just the stream of characters.  The designers of ASCII tried to address the structure of information by defining so-called control characters for “field separator,” “record separator,” and the like.  But this feature of ASCII has not stood the test of time; they provided too simplistic a view of information structure and have simply fallen into disuse.  Descriptive markup with angle-bracketed tags (first in SGML and now in XML) has emerged as the ASCII-based solution that is standing the test of time.

Another indicator for the near certain longevity of XML is the answer to another question from critically thinking linguist, namely, “Is XML really practical, or is it just another nice theory?” Again, the answer to the latter is a resounding “No!”  XML has become part of the fabric of the global information infrastructure.  It is the centerpiece of a whole family of open standards from the World Wide Web Consortium that have been embraced by all the major software vendors (like Microsoft, IBM, Sun, and Oracle).  On top of this, hundreds of small vendors and open-source projects have developed tools that implement the XML family of standards.  In short, XML is fueling a level of information interchange and reuse that is unprecedented in history.

7. Conclusion

In conclusion, I want to go beyond the question of what a single linguist should do, to ask “What’s linguistics to do?”  At present, the individual linguist who embraces these concerns is more like the lone prophet who is bucking the tide than like the professional who is flowing in the mainstream. For enduring practice to really take hold, the linguistics community as a whole needs to recognize the fleeting value of digital working forms and presentation forms and to encourage practices that will result in enduring archival forms.  Three steps in this direction would be:

Only by taking steps like these can we ensure that our digital data will truly endure.


Bergeron, Bryan.  2002.  Dark ages II: When the digital data die.  Upper Saddle River, NJ: Prentice-Hall.

Bird, Steven and Gary Simons. 2003. Seven dimensions of portability for language documentation and description. Language 79:557–582. Online preprint:

Brand, Stewart.  1998.  Written on the wind.  Civilization Magazine, November 1998.  Online:

Brandel, Mary. 1999.  1963: ASCII debuts.  Computerworld, 12 April 1999. Online:,11280,35241,00.html

Bray, Tim, Jean Paoli, and C. M. Sperberg-McQueen. 1998. Extensible Markup Language (XML) 1.0, W3C Recommendation 10-February-1998. World Wide Web Consortium. Online:

Byers, Fred R. 2003. Care and handling of CDs and DVDs: A guide for librarians and archivists. Washington, DC: Council on Library and Information Resources and Gaithersburg, MD: National Institute of Standards and Technology. Online: and

Coombs, James H., Allen H. Renear, and Steven J. DeRose. 1987. Markup systems and the future of scholarly text processing, Communications of the ACM 30(11):933-947. Online:

Deegan, Marilyn and Simon Tanner.  2002.  The digital dark ages.  Library and Information Update, May 2002 issue.  Online:

Donner, William. 1987. Sikaiana Vocabulary: Na male ma na talatala o Sikaiana. Honiara, Solomon Islands: published by the author through a grant from the South Pacific Cultures Fund of the Australian government. 267 pp.

EMELD. 2001. Working group reports and recommendations. Workshop on the Digitization of Language Data: The Need for Standards. Santa Barbara, California, 21-24 June 2001.  Online:

EMELD. 2005. EMELD School of Best Practices in Digital Language Documentation. Electronic Metastructures for Endangered Language Documentation project, Eastern Michigan University.  Online:

Frank, Paul S. and Gary F. Simons. 2006. Sáliba wordlist project: A case study in best practices for archival documentation of an endangered language. To appear in SIL Electronic Working Papers.

Jesdanun, Anick.  2003.  Coming soon: A digital dark age?  Associated Press, New York, 21 January 2003.  Online:

Ladefoged, Peter. 2003. Phonetic data analysis: An introduction to fieldwork and instrumental techniques. Oxford: Blackwell.

LSA. 2005. 2005 Annual Meeting. LSA Bulletin, number 187 (March 2005). Online:

MATRIX. 2002. Digital imaging for archival preservation and online presentation: Best practices. MATRIX: The Center for Humane Arts, Letters and Social Sciences, Michigan State University. Online:

McKie, Robin and Vanessa Thorpe.  2002.  Digital Domesday Book lasts 15 years, not 1000.  The Observer, 3 March 2002.  Online:,6903,661093,00.html

Microsoft. 2004. Word 2003: Rich Text Format (RTF) specification, version 1.8. Microsoft Corporation.  Online:

Microsoft. 2005. Office 2003: XML reference schemas. Microsoft Corporation. Online:

NSF. 2005. Documenting Endangered Languages (DEL): An interagency partnership. Program Solicitation NSF 05-590. Online:

Quin, Liam (ed.). 2005. The Extensible Stylesheet Language Family (XSL). World Wide Web Consortium. Online:

Simons, Gary F. 1989. Chapter 1, “Introduction,” and chapter 2, “A generic style sheet for academic publishing.” In Laptop Publishing for the Field Linguist: An approach based on Microsoft Word, edited by Priscilla M. Kew and Gary F. Simons.  Occasional Publications in Academic Computing, number 14.  Dallas, TX: Summer Institute of Linguistics. 

Simons, Gary F. 1998a. The nature of linguistic data and the requirements of a computing environment for linguistic research. In Using Computers in Linguistics: a practical guide, John M. Lawler and Helen Aristar Dry (eds.). London and New York: Routledge. Pages 10-25. Online preprint:

Simons, Gary F.  1998b.  Using architectural processing to derive small, problem-specific XML applications from large, widely-used SGML applications. SIL Electronic Working Papers 1998-006.  Online:

Simons, Gary F.  2005.  Beyond the brink: Realizing interoperation through an RDF database.  EMELD Workshop on Morphosyntactic Annotation and Terminology: Linguistic Ontologies and Data Categories for Linguistic Resources. Cambridge, MA, 1-3 July 2005.  Online:

Simons, Gary F., Kenneth S. Olson, and Paul S. Frank. 2006.  Ngbugu digital wordlist: Archival, accessible, and adequate documentation.  To appear in Linguistic Discovery.

Sperberg-McQueen, C. M. and Lou Burnard. 1994. Guidelines for the encoding and interchange of machine-readable texts.  Chicago and Oxford: Text Encoding Initiative. Online:

Stepanek, Marcia.  1998.  Data storage: from digits to dust.  Business Week, 20 April 1998.  Online:

Unicode Consortium.  2003.  The Unicode Standard, Version 4.0.  Boston, MA: Addison-Wesley.  Online:

Van Bogart, John W. C. 1995. Magnetic tape storage and handling: A guide for libraries and archives. Washington, DC: Commission on Preservation and Access and St. Paul, MN: National Media Laboratory. Online: