Using
Computers
in
Linguistics:
A Practical Guide

Chapter 1

The Nature of Linguistic Data and the Requirements of a Computing Environment for Linguistic Research

Gary F. Simons
Summer Institute of Linguistics

Online Appendix: Multilingual Computing


Summary

Multilingual Computing

Text Encoding

Databases


Multilingual Computing and the Problem of Character Encoding and Rendering

The first of the six requirements is:

The data are multilingual, so the computing environment must be able to keep track of what language each datum is in, and then display and process it accordingly.

This appendix first gives pointers to some general resources on this topic, then gives resources relating to the fundamental problem of character encoding and rendering.

General resources

About multilingual computing in general:

Resources for software developers:

On developing a truly multilingual World Wide Web:

Character encoding and rendering

Fundamental to the problem of multilingual computing is the problem of character encoding and rendering. Below is a glossary of key terms discussed in this chapter of the book; basic definitions are supplemented with pointers to further information resources. (Adobe's Type Technology Forum has a glossary of about 30 terms; Apple Computer's Inside Macintosh: Text contains a richer glossary of over 300 terms.)

ASCII

American Standard Code for Information Interchange. A standard character set that maps character codes 0 through 127 onto control functions, punctuation marks, digits, upper case letters, lower case letters, and other symbols.

ASCII file

A data file that contains only character codes in the range 0 to 127 and in which all the codes are to be interpreted by their significance in the ASCII standard.

base character

A character to which an overstriking diacritic is added.

character

The minimal unit of encoding for text files. A character typically corresponds to a single graphic sign of a writing system, like a letter of the alphabet or a punctuation mark.

character code

A numerical code in a data file which represents a particular character in text. For instance, in the ASCII standard, 65 represents upper-case A.

character set

The full set of character codes used for encoding a particular language or writing system (also known as, coded character set).

Some sources that discuss concepts and terminology:

These sources describe the contents of particular character sets:

collating sequence

The sorting order for all the characters in a character set.

composite character

A single character which is a composite of two or more other characters. For instance, à is a composite of a (the base character) and ` (a diacritic).

diacritic

A small mark (such as an accent mark) added above, below, before, or after a base character to modify its pronunciation or significance.

encoding

The manner in which information is represented in computer data files. Character encoding refers specifically to the codes used to represent characters. (See also text encoding.)

font

A collection of bitmaps or outlines which supply the graphic rendering of every character in a character set.

font system

A subcomponent of an operating system which gives all programs and data files access to multiple fonts for rendering characters.

For instance,

rendering

The process of converting a stream of encoded characters (that is, character codes) to their correct graphic appearance on a terminal or printer.

The seminal work on encoding versus rendering is:

  • Becker, Joseph D. (1984) ‘Multilingual word processing.’ Scientific American, 251(1):96-107.
  • An operational model for characters and glyphs is an ISO/IEC technical report that seeks to develop a formal framework for defining the process of rendering from coded characters to glyph representations.

special character

A character that is not available in one of the character sets already supported on a computer system.

Unicode

A character set which attempts to include every character from all the major writing systems of the world. It uses two bytes (16 bits) to encode each character. In its current version (2.0), the Unicode standard contains 38,885 distinct coded characters from 25 scripts (including the International Phonetic Alphabet).

See:

WorldScript

A subcomponent of the Macintosh operating system (version 7.1 and later) which gives programs access to script interface systems for multiple non-Roman writing systems.

Some relevant publications:

  • Appendix A: Built-in Script Support of Apple Computer's Inside Macintosh: Text describes the Roman script system, WorldScript I, and WorldScript II.
  • See also The Script Manager from Apple Computer's Inside Macintosh: Text.
  • Davis, Mark E. (1987) ‘The Macintosh script system.’ Newsletter for Asian and Middle Eastern Languages on Computer, 2(1&2):9-24.


Up to Chapter Page | Up to Book Page
Summary | Multilingual Computing | Text Encoding | Databases


This page is part of an online appendix for the book Using Computers in Linguistics: A Practical Guide, edited by John M. Lawler and Helen Aristar Dry (Routledge, 1998).

Last modified: January 7, 2000