One means of representing linguistic data (along with the linguist's analysis of it) is to use a special markup language to encode the information in text files. SGML and XML are the most widely used markup languages, and the TEI is an example of a widely accepted encoding scheme developed specifically for linguistic and literary data. The following is a glossary containing these and other key terms related to text encoding. Basic definitions are supplemented with pointers to further information resources.


The manner in which information is represented in computer data files. Text encoding refers specifically to the way in which the structural (and even interpretative) information in text is encoded. (See also character encoding.)


Codes added to the stream of an encoded text to signal structure, formatting, or processing commands.

generalized markup

The discipline of using markup codes in a text to describe the function or purpose of the elements in the text, rather than their formatting.

style sheet

A separate file that is used with a document containing generalized markup to declares how each generalized text element is to be formatted for display.


SGML (for Standard Generalized Markup Language) is a method for generalized markup that has been adopted by ISO (the International Organization for Standardization) and has consequently gaining widespread use in the world of computing. The most widely-known application of SGML is HTML (the HyperText Markup language).

XML (for Extensible Markup Language) is a simplified subset of SGML that has been developed by the W3C (World Wide Web Consortium) for information interchange on the Web.

Document Type Definition. The definition of the markup rules for an SGML document.
A string of characters inserted into a text file to represent a markup code. In SGML, each text element is delimited by a start tag of the form <type> and an end tag of the form </type>.
In an SGML file, a single entity delimited by a start tag and an end tag. For instance, a paragraph element might be delimited by <p> and </p>.
In SGML, a qualifier within the opening tag for an element which specifies a value for some named property of that element. For instance, a chapter tag might use the attribute n to encode the chapter number, as in <chapter n="3">


Text Encoding Initiative. A joint effort of the Association for Computers and the Humanities, the Association for Literary and Linguistic Computing, and the Association for Computational Linguistics to develop SGML-based guidelines for the encoding of texts and the analysis of texts.

