A Practical Guide

Chapter 1

The Nature of Linguistic Data and the Requirements of a Computing Environment for Linguistic Research

Gary F. Simons
Summer Institute of Linguistics

Online Appendix: Using Text Encoding


Multilingual Computing

Text Encoding


Using Text Encoding to Represent Linguistic Data

One means of representing linguistic data (along with the linguist's analysis of it) is to use a special markup language to encode the information in text files. SGML and XML are the most widely used markup languages, and the TEI is an example of a widely accepted encoding scheme developed specifically for linguistic and literary data. The following is a glossary containing these and other key terms related to text encoding. Basic definitions are supplemented with pointers to further information resources.


The manner in which information is represented in computer data files. Text encoding refers specifically to the way in which the structural (and even interpretative) information in text is encoded. (See also character encoding.)


Codes added to the stream of an encoded text to signal structure, formatting, or processing commands.

generalized markup

The discipline of using markup codes in a text to describe the function or purpose of the elements in the text, rather than their formatting.

A seminal paper on the discipline of generalized markup is:

style sheet

A separate file that is used with a document containing generalized markup to declares how each generalized text element is to be formatted for display.


SGML (for Standard Generalized Markup Language) is a method for generalized markup that has been adopted by ISO (the International Organization for Standardization) and has consequently gaining widespread use in the world of computing. The most widely-known application of SGML is HTML (the HyperText Markup language).

XML (for Extensible Markup Language) is a simplified subset of SGML that has been developed by the W3C (World Wide Web Consortium) for information interchange on the Web.

Some introductions to SGML:

Some key XML resources:

For pointers to virtually any resource related to SGML or XML, see The SGML/XML Web Page by Robin Cover. Some highlights:

Here is a glossary of some key SGML (and XML) terms:

Document Type Definition. The definition of the markup rules for an SGML document.
A string of characters inserted into a text file to represent a markup code. In SGML, each text element is delimited by a start tag of the form <type> and an end tag of the form </type>.
In an SGML file, a single entity delimited by a start tag and an end tag. For instance, a paragraph element might be delimited by <p> and </p>.
In SGML, a qualifier within the opening tag for an element which specifies a value for some named property of that element. For instance, a chapter tag might use the attribute n to encode the chapter number, as in <chapter n="3">


Text Encoding Initiative. A joint effort of the Association for Computers and the Humanities, the Association for Literary and Linguistic Computing, and the Association for Computational Linguistics to develop SGML-based guidelines for the encoding of texts and the analysis of texts.

See The TEI Home Page. Some highlights:

The TEI Guidelines are published in print, on CD-ROM, and online:

  • Sperberg-McQueen, C. M. and Lou Burnard. (1994) Guidelines for the encoding and interchange of machine-readable texts. Chicago and Oxford: Text Encoding Initiative.

Of special interest to linguists are:

Up to Chapter Page | Up to Book Page
Summary | Multilingual Computing | Text Encoding | Databases

This page is part of an online appendix for the book Using Computers in Linguistics: A Practical Guide, edited by John M. Lawler and Helen Aristar Dry (Routledge, 1998).

Last modified: January 8, 1999