SIL Electronic Working Papers 1997-003, June 1997
J. Albert Bickford
All rights reserved.

A Rich Model for Presenting Interlinear Text[1]

J. Albert Bickford

Traditional presentation of interlinear text
A rich text model


This paper examines the advantages of presenting interlinear texts using a multi-line model. The particular one discussed here includes three types of transcription, glosses and notes in two languages, glosses and translations at three levels, and a citation form for each inflected word.

This model has many advantages over traditional three-line models. It addresses a broader audience of scholars and native speakers, provides clearer representation of the semantic structure, and links directly with published dictionaries. The greater number of lines enable editors and others who don't know the language to assist the linguist by checking for consistency between lines.

If texts from different languages are all prepared according to this model, comparison between them is much easier. Such a collection is currently being assembled, consisting of texts from languages in Mexico and the Southwest U.S.A, with accompanying explanatory materials.

Traditional presentation of interlinear text

Traditionally in linguistic studies, interlinear text materials have often been presented with just 2 lines of annotations for each line of text. For example, in (1), there is a line of text with morpheme breaks, a second line with aligned morpheme glosses, and a third line with a free translation.

(1) Traditional model for interlinear text

t- o- amatoX    / 'amák ko- t- o- atni  =ma  =X... / fire  3Ob-Rl-UO-touch DSRl UT

He was making fire, he was making smoke signals...

The simplicity of this presentation was largely dictated by available technology. Until computer software became available to maintain vertical alignment and do much of the glossing semi-automatically, any attempt to provide more information on more lines was so tedious and time consuming that few people attempted it.

A Rich Text Model

With specialized text glossing software, however, it is much easier to include many lines in the text model (the specification of which lines are included and what information each contains), and thus provide much more information about the text and the language[2]. This article illustrates the benefits of doing so, by discussing one particular text model that has been developed over the past ten years for use in a public archive of glossed texts and other related materials from languages of Mexico and the Southwestern United States[3]. This model is illustrated in (2).

(2) A rich model for interlinear text

\po Toomatox           hamac_cötootnimax...
\ew he.was.signaling.with.smoke
\sw encendía.lumbre    señalaba.con.humo
\cf coomatox           cootni

\to tóomatoX           'amák_kwtóotnimaX  
\mr t- o- amatoX       'amák  ko- t- o- atni  =ma  =X  
\em    fire   3Ob-Rl-UO-touch DSRl UT  
\sm Rl-OI-encender     lumbre 3Ob-Rl-OI-tocar SDRl TI  

\et He was making fire, he was making smoke signals...  
\st Encendía lumbre, señalaba con humo...  
\en /amatoX/ signifies making fire by rubbing two sticks together.  
\sn /amatoX/ significa encendiendo lumbre por frotación de dos palitos.

Each line begins with a short marker identifying it; these are explained in (3)[4].

(3) Markers used in (2)

\po Practical Orthography
\ew English glosses for each Word
\sw Spanish glosses for each Word
\cf Citation Form (for crossreference to published dictionaries)
\to Technical Orthography (in standard phonetic symbols, but showing surface contrast only)
\mr Morphemic Representation (abstracting away from morphophonemic variation)
\em English glosses for each Morpheme
\sm Spanish glosses for each Morpheme
\et English (free or semi-literal) Translation
\st Spanish (free or semi-literal) Translation
\en English Notes
\sn Spanish Notes

At first, this many lines may be confusing or perhaps even intimidating. It is quite a change from the very limited traditional models like (1). However, a person does not need to deal with all lines at once. Software can control which lines are displayed for a particular purpose; the rest can be hidden[5]. Plus, experience has shown that people quickly become familiar with the different lines and their purposes, and are able to work with them effectively even while looking at all of them at once.


With the great number of possible lines that could conceivably be included in a rich text model, the specific ones used should be chosen carefully with a specific audience and purpose in mind. To see how goals influence the design of a text model, consider the goals for this model, as listed in (4).

(4) Goals for the text model in (3)

a. To faithfully represent as much information as possible about the texts and the language, as part of a permanent and publicly-accessible archive of language data
b. To facilitate the comparison of different languages in the archive
c. To facilitate use of the texts with other published materials
d. To address the interests and needs of linguists, anthropologists, folklorists, and native speakers
e. To make the texts accessible to people who know either English or Spanish (but not necessarily both)
f. To do all this within practical constraints of time, money, and available knowledge

Each line in the text model addresses specific needs of one or more groups in the target audience (4d and 4e). A smaller text model would have required many compromises, making it harder for one or more groups to use the materials. A larger text model was thus required by the diversity of the audience that we were preparing materials for.

However, once the extra lines were added, we discovered extra benefits from this particular combination of lines. I discuss below both the original reasons and the unexpected benefits which combine to make this an extremely useful model for presenting interlinear text.

The text is provided in three transcriptions. One is in the practical orthography used by native speakers (the \po line). It is included both for their benefit and because existing transcriptions of texts almost always are in the practical orthography. The second transcription line is a technical one using standard phonetic symbols (the \to line); this line is intended for professional linguists and makes comparison between languages easier. These two are distinct from a more abstract representation of the morphological structure, in the \mr line. (For guidelines on representing morphological structure and morpheme glosses, see Lehmann 1982.)

To see why it was important to have lines both with and without morpheme breaks, compare the \to and \mr lines in (2). Without knowing the phonological analysis, one could not pronounce the data accurately based solely on the \mr line. A 3-line model like (1), in which the data is represented only once, contains an awkward compromise: by representing the morphemes consistently, it omits direct representation of the actual pronunciation. In other words, some of the most basic facts are hidden. Alternately, a 3-line model could preserve the surface facts of pronunciation, but at the expense of representing morphemes inconsistently by means of their surface variants. By including both types of transcription, we can represent both phonological and morphological facts without compromising either.

The word glosses, in the \ew and \sw lines, are included especially for the benefit of non-linguists, for whom morpheme glosses may be largely useless or even unintelligible. For such an audience, it is important to make the word glosses very readable; in materials that have been prepared according to this model, they contain no specialized linguistic terminology and consist entirely of translation equivalents which are adjusted to reflect the context, according to the rules of the target language (English or Spanish).

For example, in (2), the English word glosses use the masculine pronoun 'he', even though the Seri forms are neutral as to gender, because in this context the words refer to a male. The same words in a different context might be glossed with 'she' or 'it'. Thus, in word glosses, some precision can (and I believe, should) be sacrificed in order to achieve smooth readability; those who want precision can always look at the morpheme glosses (\em and \sm).

The remaining lines require little comment. The model includes glosses, translations, and notes in two target languages, English and Spanish. Indeed, this accounts for much of the bulk in the model, but it is of course important to make the texts in the archive useful to scholars and native speakers in both the United States and Mexico. The \cf, or citation form, line provides an indication of where to find additional information about each word in a published dictionary, and thus facilitates study of texts and dictionary together[6]. Finally, the \en and \sn lines provide a place to include all sorts of miscellaneous information, the sort of information that would normally be included in footnotes in conventional publication.


Minor variations on this model can be made for special needs of different languages, and there are some lines that could conceivably be added. However, this basic model has served well to meet the goals in (4). In addition, because of its richness, it has provided some unanticipated benefits.

Many of these benefits came about because of including word glosses as well as morpheme glosses. At first, the reason for providing word glosses in the archive was a bit snobbish: it was a concession to the needs of non-linguists. Then, surprisingly, as linguists preparing the texts, we found we were relying on them quite heavily ourselves. They provide a helpful bridge between the highly analytical morpheme glosses and the free translation. For example, compare (1) and (2); in (2) it is easier to understand how the meaning of the whole sentence is derived from the meanings of the parts. This is because the mental task of relating the morphemes to the meaning of the whole is broken into two stages. The word glosses show how each morpheme contributes to the meaning of the whole word, and the free translation shows how each word contributes to the meaning of the sentence. By including word glosses, the reader's job is much easier in morphologically-complex languages.

Further, this double layer of interlinear glosses can be exploited to give a clearer picture of the semantic structure of the language. When a word has more than one sense, the word gloss can be based on the sense in context, while the morpheme gloss can be based on the core meaning of the word. For example, look in (5) at the word cmam. The stem mam has two senses, 'ripe' and 'cooked'; the extended sense, 'cooked', is the one used here, and so it appears in the word gloss. The core meaning, 'ripe', is shown in the morpheme gloss.

(5) Using multiple glosses to highlight extended senses

\po hacx        hant   tahcniiixo           / yoque        cmam      quih.  
\ew somewhere   land   it.was.poured        /   cooked    the  
\sw algún.lugar tierra fue.derramado        / se.dice      cocido    el  
\cf hacx        hant   cacníiix             / teeque       cmam      quih  

\to 'ákX        'ánt   ta'kníiiXo           / yókæ         kmám      k(i)'
\mr 'akX        'ant   t- aa'-akníiiX -o    / yo-ka-ææãSRõ k- mam    k' 
\em somewhere   earth  Rl-Pv- pour    -AdvS / Dt-US-say    SN-ripe   DefU 
\sm algún.lugar tierra Rl-Pv- derramar-SAdv / Dt-SI-decir  NS-maduro DefI 

\et cooked food was dumped out (there was so much). 
\st comida cocida se tiraba (había tanto). 

Similarly, when two or more words combine to form an idiom chunk, the word gloss can represent the idiomatic meaning, and the morpheme glosses can represent the literal meaning of the pieces that make up the idiom. For example, in (2), the second clause uses an idiom which means 'to make smoke signals'. The literal meaning, 'to touch fire' is apparent in the morpheme glosses, while the idiomatic meaning is in the word glosses. By having both word and morpheme glosses, the semantic structure is clearer than if there was just one level of glossing.

As it turns out, word glosses also provide a number of practical benefits while preparing the materials. First, word glosses can be provided more easily by native speakers, allowing them to participate more actively in the preparation of the materials than if only morpheme glosses were included in the model. Word glosses provide a vehicle for drawing out many of their intuitions about the language without requiring that they have formal linguistic training. Second, a morphemic analysis may be difficult to provide in early stages of analysis of a language, but word glosses are much easier. Thus, a person can still reap the benefits of preparing and studying glossed texts before the morphological analysis is well developed. This is especially important in languages with complex morphologies, or when a morphemic analysis cannot be provided because of other practical problems (such as the amount of time the preparer has available to work on the project). In such cases word glosses can still be provided by themselves, without morpheme glosses, as a way of presenting and preserving some information about the language, without being stymied by information that is not available.

Finally, having glosses at several levels, and in two languages, is a great help to editors and reviewers. Any time there is an inconsistency between two lines, this indicates a possible problem. For example, in a draft of the text it may not be obvious how the meanings of the morphemes combine to form the meanings of the words. This may be a clue that a better gloss could be found for one or more morphemes, or for the word as a whole. Or, certain important information may have been accidentally left out of the free translation; this is usually easier to spot by reading the word glosses than by decoding the morpheme glosses. Or, the English and Spanish versions of the same gloss or note may disagree. By comparing lines with each other, editors have a way of spotting and alerting the preparers to problems that they may have missed because of being too familiar with the language and the texts. If the editors had only a couple of lines to look at, they wouldn't notice the problems either, because they don't know the language. So, a richer text model gives editors and reviewers the tools they need to do their job, which is to help the preparers of the texts provide as complete, clear, and accurate a representation of the texts as possible.


A rich model such as in (2) can thus provide a number of benefits, which are summarized in (6).

(6) Advantages of (2) over (1)

a. Provides a fuller representation of phonological and semantic structure, without compromising the representation of morphological structure
b. Addresses the needs and interests of a wide audience, including nonlinguists (provided that one can selectively hide lines that are not of interest to a particular audience)
c. Allows nonlinguists (especially native speakers) to help prepare the texts
d. Allows editors and reviewers who do not know the language to play a more active and helpful role in preparing the texts

Current technological advances in text glossing software have thus made it much easier to present analyzed text in a rich model than was formerly possible even with simpler models. We can hope that this capability will result in a larger and more useful corpus of interlinear texts in many more languages than is available now. I trust that this article will have stimulated useful thought to this end and encourage others to take advantage of the tools available.


1 This article incorporates material presented at the Summer Linguistics Institute of the Linguistic Society of America, June, 1989, and the Linguistic Association of the Southwest (LASSO), October 1992, both in Tucson AZ. The comments and suggestions of the audience at both these presentations are gratefully acknowledged. I am also thankful for the assistance of my colleagues in the Summer Institute of Linguistics as they have tested this model on their own language data. Steve Marlett deserves special appreciation, for he has worked actively with me in developing the text model described here, and provided helpful comments on the paper. ('We' in this article refers especially to Marlett and myself, as well as a number of other colleagues including Doris Bartholomew, Burt Bascom, Margaret Daly, Ben Elson, and Chuck Speck.) Also, Steve Marlett and Becky Moser produced one of the first submissions to the archive, in Seri (Sonora, Mexico); the examples in this paper are based on their work. (Some modifications in the symbols used for transcription have been made in order to present the paper using standard fonts available to current HTML browsers.) I also wish to express appreciation for the encouragement and support of colleagues at the institutions that intend to sponsor and maintain the archive, especially Terry Langendoen and Jane Hill (U of AZ); Paulette Levy, Karen Dakin, and Veronica Vazquez (UNAM); and Zarina Estrada (UniSon).

2 Two such programs which I and my SIL colleagues in Mexico have used extensively are IT (Simons and Versaw 1987, reviewed in Society for the Study of the Indigenous Languages of the Americas 1988) and Shoebox (Wimbish 1990, Davis and Wimbish 1993). MS-DOS versions of these programs are available from the Academic Bookstore, Summer Institute of Linguistics, 7500 W. Camp Wisdom Rd., Dallas TX 75236, 972/708-7400. A MacIntosh version of IT (Simons and Thomson 1988) is available from Linguist's Software, P.O. Box 580, Edmonds WA 98020-0580, 206/775-1130. A Windows/MacIntosh version of Shoebox is available from SIL's Web site. A third software package that has recently been developed by SIL is LinguaLinks, which integrates text glossing with dictionary production and general description of a language. For typeset output of interlinear texts, we have used ITF (Kew and McConnel 1990), a package of TeX macros available from SIL in Dallas; these produce very high-quality output but also are very demanding in expertise and time. For current information on all these programs, go to the computing section on SIL's Web site at

3 The archive has been tentatively titled Archivo de Textos en Lenguas Indígenas de México (ATLIM). As of May 1997, a preliminary agreement has been signed calling for cooperative sponsorship by the University of Arizona, the Universidad de Sonora, and the Universidad Nacional Autónoma de México; negotiations for a final agreement are nearly complete. The contents of the archive will be almost entirely in electronic form distributed on the Internet. Plans call for SIL to be the principal contributor for the first few years; materials have begun to be prepared in about 45 languages, with about 10 languages in advanced stages of preparation. As part of its mission, the archive will encourage and promote the production and publication of glossed texts (using a rich model like the one described here) and other associated materials by offering training and editorial assistance to anyone who wants to prepare materials for submission to the archive. Those who are interested in doing so should contact Albert Bickford, SIL, P.O. Box 8987, Catalina AZ 85738-0987, USA. Phone: (520) 825-1229. Email:

4 The use of markers in this form is an artifact of the software that has been used to prepare the texts; I have retained them for this article for convenience in labeling the lines. Newer software, such as LinguaLinks, makes such markers unnecessary. LinguaLinks also is able to display a text in a variety of views, with different labeling schemes for the lines as well as choices for which lines to display and how they should be arranged visually.

5 See endnote 4.

6 The \cf field also can serve an important role in the lexical database that is developed as a by-product of glossing the texts, by providing a way of identifying the different word forms of a given lexeme. This can be exploited for further study of the lexical database itself.


Date created: 13-Mar-1997
Last modified: 4-Jun-1997

