The Nature of Linguistic Data and the Requirements of a Computing Environment for Linguistic Research
Online Appendix: Using Text Encoding
Using Text Encoding to Represent Linguistic Data
One means of representing linguistic data (along with the linguist's analysis of it) is to use a special markup language to encode the information in text files. SGML and XML are the most widely used markup languages, and the TEI is an example of a widely accepted encoding scheme developed specifically for linguistic and literary data. The following is a glossary containing these and other key terms related to text encoding. Basic definitions are supplemented with pointers to further information resources.
The manner in which information is represented in computer data files. Text encoding refers specifically to the way in which the structural (and even interpretative) information in text is encoded. (See also character encoding.)
Codes added to the stream of an encoded text to signal structure, formatting, or processing commands.
A seminal paper on the discipline of generalized markup is:
SGML (for Standard Generalized Markup Language) is a method for generalized markup that has been adopted by ISO (the International Organization for Standardization) and has consequently gaining widespread use in the world of computing. The most widely-known application of SGML is HTML (the HyperText Markup language).
XML (for Extensible Markup Language) is a simplified subset of SGML that has been developed by the W3C (World Wide Web Consortium) for information interchange on the Web.
Some introductions to SGML:
Some key XML resources:
For pointers to virtually any resource related to SGML or XML, see The SGML/XML Web Page by Robin Cover. Some highlights:
Here is a glossary of some key SGML (and XML) terms:
Text Encoding Initiative. A joint effort of the Association for Computers and the Humanities, the Association for Literary and Linguistic Computing, and the Association for Computational Linguistics to develop SGML-based guidelines for the encoding of texts and the analysis of texts.
See The TEI Home Page. Some highlights:
Of special interest to linguists are:
Up to Chapter Page | Up to
Summary | Multilingual Computing | Text Encoding | Databases
This page is part of an online appendix for the book Using Computers in Linguistics: A Practical Guide, edited by John M. Lawler and Helen Aristar Dry (Routledge, 1998).
Last modified: January 8, 1999