This workpaper helped to launch the OLAC effort. It was originally posted as a workpaper of the TalkBank project at but has now been removed.

Developing an infrastructure for online linguistic archives

Gary Simons, SIL International
12 June 2000

Problem statement

A number of institutions are in the process of developing online archives for documenting so-called "low density" languages. These include:

Many of these efforts are described in Steven Bird's Linguistic Exploration page, as are a number of projects that focus on documenting a single language. In addition to the above projects which are located in the US, I can think of the following that are elsewhere:

Though we might wish for an ideal world in which all language documentation was archived in a consistent manner in a single archive, it seems clear that the world we live in requires that any institution be able to host an archive of the research it has pursued or sponsored. But developing the infrastructure for such an archive (including delivery mechanism, formatting standards, and supporting software) is a huge task that is beyond the capacity of any of these single institutions to accomplish on its own. And once a host of institutions have set up online archives, then the individual linguist who wants to find all the resources that might be available on a particular language faces the problem of knowing what is held in each of these archives and of determining how the different archives have named or classified that language.

Thus, as we embark on the age of online electronic archiving of language documentation, the linguistics community is facing two big problems:

The major components of a solution

Much of the effort needs to distributed:

But in order for this distributed approach to dissemination and development to work, the basic components of infrastructure need to be centralized. Specifically,

All of these centralized functions need not be at the same site, as long as the linguistic community knows where to go for each of the centralized functions. At this point, I would propose the following three sites for these centralized functions:

Defining standards for metadescription

The most basic requirement of building a useful worldwide archive of linguistic documentation is to have a standard for metadescription of the resources, so that there can be a single catalog in which all materials in the distributed collection are consistently cataloged. This is because the most fundamental requirement of an archive is that the user be able to locate the material that may be of interest. Consistency in format of the resources themselves is not nearly as important as consistency in format of the metadescriptions. For instance, in the card catalog of a library, all of the catalog entries have a consistent format--even though the resources they catalog may be of very different formats--and this catalog is the key to using the library.

In the same way, the key to a distributed digital library of documentation on languages of the world will be a single and consistent catalog. The metadescriptions in that catalog will make it possible for users to find resources of interest. The catalog would store a URL to the resource wherever it was posted on the web.

This suggests the following activities as essential first steps in developing the infrastructure for online linguistic archives:

Establishing best practice for data

In order not to exclude the wealth of electronic language resources that already exists, the entry level of the archive should not exclude data on the basis of format. At a minimum, if a resource has a metadescription that conforms to the standard, then it should be archivable regardless of format. The metadescription should include a description of the format as a further aid to the user who may be interested in consulting it. However, attention to data must not end there. Two factors are critical: longevity and best practice.

We must be concerned with the longevity of archived data. That is, data need to be stored in formats that will still be readable far into the future. The least long-lived formats are proprietary binary formats that can only be read using a single vendor's licensed software. For instance, ten-year-old Microsoft Word files can no longer be read by current versions of the program. The most long-lived formats are those that are independent of vendor and proprietary software. Plain ASCII text files are an example of a format that has not lost readability over the past forty years, and XML promises to have the same kind of longevity (with the added bonus that it uses the worldwide Unicode character set and explicitly encodes document structure).

We must also establish guidelines for what our community considers to be best practice. Best practice will always entail a long-lived format, but it must go beyond that to recommend the best way to use long-lived formats. For instance, it is not enough to say, "Use an XML format"; we must go further to recommend a specific tagging system that is defined by an XML DTD or Schema. There are two main reasons for this:

This suggests the following activities as further steps in developing the infrastructure for online linguistic archives:

A framework for the repository

The job of building the software resources needed for building and using an online linguistic archive is too big for any one institution to do. The approach should therefore be to do it as a community. The key to making this work is having a single, open, online repository of all the formatting standards and everything that has been implemented to support those standards.

A mature repository could hold hundreds of resources; thus it is important to develop a means for organizing it. I propose that we view it as a two-dimensional framework based on data type and function, which would look something like this:

Function > Store Formats for storing data Display Stylesheets for displaying data Query Procedures for querying data Convert Procedures for converting data Create Programs for creating data
Data type
Word list          
Writing system          
Annotated speech          
Field notes          

The idea is that any resource to be deposited in the repository needs to be placed in one of these cells, based on what functions it performs for what type of data. The index page of the implemented repository could be a table like this, with a link in each cell jumping to a page of relevant resources. This framework would also give us a way of talking about who is doing what and of organizing further work . That is, we could categorize the work of a developer by identifying which cell the result would go into, and we could look for cells with no contents to know where it is important to do future work.

In the horizontal dimension are the types of data that we will archive. The following are some of the basic types we must deal with:

The description of what is in a data resource that can be used in the online catalog as an aid to finding the resource.
Word list
A list of wordforms in the language indexed by reference glosses (for example, a Swadesh word list). This is not just a simplified lexicon; unlike a lexicon, the indexing against a list of universal reference glosses provides a data structure for cross-linguistic comparison.
Writing system
A description of the writing system used to express text in the language.
Annotated speech
Samples of speech (in audio or video recording) that are annotated for transcription and various kinds of analysis. Interlinear text can be treated as a special case of annotated speech in which the base recording is absent.
A listing of the lexical items in a language with descriptions of their phonological form, morphosyntactic function, and semantics.
Field notes
The initial observations a linguist makes in the field.
Any work of prose that describes some aspect of the language (for instance, a grammar sketch or a workpaper on phonology).
This row is for resources that are common to all data types (like fonts or character-code conversion tables).

In the vertical dimension are the functions we want to perform with archived data. The main functions are:

In order to store the data, we must establish its format. This is where DTDs (or XML Schemas) and accompanying coding manuals would go. A single format might be nominated as best practice, while other formats that are in use could still be listed and documented.
The function that will most often be invoked by users of an archive is to display the data. To assist in this, this column in the repository framework will store stylesheets (especially XSL) designed for displaying linguistic data. ActiveX components or Java applets for achieving special linguist displays would also fit here.
The other thing that archive users will most often do is query the data. Users would query metadescriptions to find resources, so that is the cell where querying of the archive catalog belongs. But once the resources of interest have been found, users will want to perform queries on the details of the content. The XML query language under development by the W3C will go a long way toward performing this functionality, but cells in this column could still contain sample queries on the various types of data or tools (such as dynamic web pages) that provide simplified interfaces for building queries on a particular data type.
An important function to be performed by archives and by contributors is to convert data from the format in which it was developed to the format prescribed as best practice. A key issue in this respect will be conversion from (often arbitrary) 8-bit character coding to Unicode; a general tool for character conversion along with a host of conversion tables for specific character sets would go in the Common row. The other key issue will be conversion from one style of markup into another; for this, there could be programs or XSLT scripts specific to a particular data type.
Creating the data in the first place is the function most on the mind of the researcher, but unlike the other four functions, it is not in fact a function that the archive itself participates. Thus, it could be argued that this function should not really be in focus for project to develop the infrastructure for an online archive. It is a problem of vital concern to the community, however, and the repository should thus manage this function within its framework so that tools for creating data can be made known to the community as they become available. An intended outcome of this effort should be that groups who have already developed tools to create data would add support to interchange data in the best practice format. In this way, their tools would be able to participate fully as producers and consumers of archive data.