SIL International Home

A Computing Environment for Linguistic, Literary, and Anthropological Research

Technical Overview

by Gary Simons
Summer Institute of Linguistics
July 1988

Linguistic, literary, and anthropological research are essentially centered around textual data. Whether we are talking about analyzing the vocabulary and grammar of an undocumented language, discovering the stylistic traits of a particular author or genre, collating a set of manuscripts to produce a critical edition, consulting multiple sources to explain the meaning of a passage, or seeking to explain the world view of a people, there is a common thread which runs throughout. All of these activities depend on controlling large amounts of textual data which are interrelated in complex ways. The data involved are not only texts as primary data, but also secondary texts such as the investigator's observations about what is being studied. It is clear that computers should be a key tool in helping researchers to achieve control over these data. However, results in this respect have been disappointing thus far. The commercial marketplace has focused on the number-crunching needs of scientific computing and on the word-processing/financial/customer-database needs of business computing. The linguist, the literary scholar, and the anthropologist have been left to make do with tools ill-suited to cope with the complexities of their data or of their research task.

In this document, I propose the conceptual architecture of a computing environment designed[1] to meet the particular needs of linguists, literary scholars, and anthropologists. In short:

We need to process textual information which is

with a database manager that is

in a user environment which is

Note that the focus here is not on the particular application programs that linguists or literary scholars or anthropologists need to use, but on the computing environment which must be built in order to support those applications. This document discusses in turn each of the nine design features just listed and describes how CELLAR will support those features.

1. Multilingual

Every instance of textual information entered into a computer is expressed in a particular language. It may in turn contain components expressed in another language, which may in turn contain parts in another language, and so on (such as when an English essay quotes a paragraph in German which discusses some Greek words). This is a fundamental, but inadequately supported, property of textual data.

The standard conception of the multilingual data problem is as a special characters problem. This approach considers the multilingualism problem solved if the computer has a way to render (both on the screen and on the printer) all the characters needed in addition to the built-in Roman character set. In the MS-DOS environment, this is achieved by designing a single extended character set that includes all needed characters.

The Apple Macintosh environment has made a significant advance beyond this (which Microsoft Windows and the forthcoming Presentation Manager seek to emulate). That is to conceive of the multilingual data problem as one of multiple fonts (Apple 1985). Data in different languages are represented in different fonts. This means that the total character inventory is no longer limited by the number of possible character codes (which is nowhere near enough for general-purpose multilingual computing). It also means that font switches in text are, for our purposes, ambiguous. They sometimes signal language switches; other times they signal type style switches within the same language.

Last year Apple introduced a further development which takes things even closer to the ideal. It is called the Script Manager (Apple 1987, Davis 1987). Following on the work of Joseph Becker at Xerox (1984, 1987) it distinguishes character encoding in computer storage from keyboarding and rendering on screen or printer. Each font is linked to a Script Interface System which defines how the script being stored is to be keyboarded and rendered. Unfortunately, defining a new script for the system is a matter of system-level programming.

Keyboarding and rendering are not the only language-dependent properties of textual data. The comparison operator which determines how two strings are related with respect to alphabetical order is language dependent. So are functions to find word boundaries or possible hyphenation points. Then there are language-specific conventions for formatting times, dates, and numbers. Most of these are handled in the Macintosh environment by the International Utilities package.

We propose to build on this conceptual foundation to create a language interface system (see figure 1) that will go all the way in supporting multilingual data processing. Our design is unique in proposing a language-centered approach to multilingual data. We believe that just as every piece of textual information is expressed in a particular language, every piece of textual information stored in a computer should be identified as to the language it is in. This is illustrated in the data portion of figure 1. Note the nesting of a Greek word within an English text shown there.

Figure 1 Language descriptions as system resources

[figure missing]

It is the language a particular datum is in (and not the font) which governs how it is entered on the keyboard and how it is to be interpreted for data processing, such as for sorting or for hyphenation. Font information should not be stored in the text at all, but should be generated only as needed for the sake of graphic rendering by means of "style sheets" (see below) applied to descriptive markup (Coombs, Renear, and DeRose 1987).

We propose to develop a definition of what it means for data to be in a language and to implement the relevant characteristics of a language as a declaration of its coding scheme, keyboarding conventions, rendering conventions, sorting conventions, hyphenation conventions, and more. Such a declaration we call a language interface definition. Figure 1 shows these language interface definitions as system resources that are available to every application program in a uniform way. When handed a datum, the language interface system determines the appropriate language-specific interface behavior. The language interface definition is to be expressed in a specially-designed high-level language (implemented as an interactive graphic interface) which the user can manipulate without being a programmer. CELLAR will include a compiler that translates the user's high level description of a language interface into the routines that would implement it.

2. Structured

The textual information we deal with as linguists, literary scholars, and anthropologists is highly structured. A typical book is organized into front matter, a body, and back matter. The body might consist of chapters and sections and subsections. A text as analyzed by a linguist might consist of episodes, paragraphs, sentences, clauses, phrases, and words. Biblical texts are conventionally organized as books, chapters, and verses. A lexicon is composed of entries which in turn have subentries for different senses which in turn contain definitions and examples and so on. (Grimes (to appear) describes the full complexities of lexicon structure.) An outline of knowledge, such as the Lingua Descriptive Survey Questionnaire (Comrie and Smith 1977) or the Outline of Cultural Materials (Murdock 1961), is typically organized in terms of major divisions which are divided into minor divisions which may in turn be divided into even lower level divisions.

All of these examples have two things in common: the texts are made up of structured content elements, and these elements are hierarchically included within each other. In general, we may say that the textual information with which we work is comprised of hierarchies of structured content elements. The second part of figure 3 provides an illustration. (There is a slight complication which CELLAR will be able to account for. We must often maintain multiple hierarchical views of the same text. For instance, the Biblical scholar must view a text not only in terms of chapters and verses, but also in terms of pericopes, paragraphs, and sentences. See Barnard and others (1988) for a discussion of this problem and a possible solution.)

Currently popular software for personal computers falls far short of supporting this view of text. Word processors have a flat view of text, treating it as a sequence of characters. Advanced word processors support an explicit paragraph level, and some even support a higher level element called something like a division, but they do not support a user-defined hierarchy of m kinds of text elements occurring in n levels. This notion of levels of structured content elements is something very different from the notion of levels in an outline processor. There are two key differences. (1) In an outline processor, there are no element types as distinct from levels. For instance, a level one element is made up of level two elements, whereas in the kind of structuring we are advocating, a book element is made up of a front matter element, a body element, and a back matter element. (2) The hierarchical structure in an outline processor stops with the lowest level headings which are at least at a paragraph level. The kind of structuring we are advocating could go all the way down to the word, or to the phonemes that make up the words, or even to the phonological features assigned to the phonemes.

The conceptual basis for supporting texts as hierarchies of structured content elements has already been worked out in a recently adopted international standard called SGML (ISO 1986). SGML, for Standard Generalized Markup Language, is a standard for representing and interchanging highly structured data. Generalized markup is the notion of marking up a text by identifying its structure rather than the way it is to appear when printed (Coombs, Renear, and DeRose 1987). For instance, one would put a markup tag in the text to say, "The following is a section title," rather than putting typesetting codes to say, "The following should be 12 points, bold, Helvetica type and it should be centered." Each different type of content element gets marked by a different code. In the generalized markup approach, details of typesetting are specified in a separate document which can be called a "style sheet" (Johnson and Beach 1988). The style sheet declares the formatting parameters which are to be attached to each type of content element when it is output for display.

The separation of content and structure from display formatting has many advantages. (1) It allows authors to defer formatting decisions. (2) It ensures that formatting of a given element type will be consistent throughout. (3) It makes it possible to change formats globally by changing only a single description in the style sheet. (4) It allows the same document to be formatted in a number of different styles for different publishers or purposes. (5) It makes documents portable between systems. (6) Perhaps most important of all for our purposes, it makes possible computerized analysis and retrieval based on structural information in the text.

SGML has the backing of the American Association of Publishers (AAP 1986). A simplified form is taught in the recent University of Chicago Press Guide to Preparing Electronic Manuscripts (Chicago 1987). The impact is being felt in the upper levels of the publishing industry; it is only just starting to trickle down to the personal computer level. New SGML-based products for authoring and editing are beginning to appear (but at much higher prices than popular word processors).

A particularly important feature of SGML, which these new products support, is that every document begins with a prologue called a Document Type Definition. For instance, the DTD for a journal article describes what it means to be a well-structured journal-article document. It does this by declaring each of the content element types that can occur in a document of that type, and then defining the allowed structure of each in terms of what element types it can contain in which order and in what combinations. (Note that this is very different from a style sheet; it describes the allowable structure of a document, not how it should be formatted for display.) Thus, these new SGML-driven editors are structured editors which validate the structure of the text being created to ensure that it conforms to the Document Type Definition. These structured editors can also help the user by showing what element types are expected or possible in a given spot.

This new generation of text editors and formatters that are based on SGML certainly shows promise, but these programs will not meet our needs since the developers of those systems are not concerned with also providing a satisfactory solution to the multilingualism problem or the other issues listed below. However, the concepts of generalized markup have a central place in our proposal, and SGML will play an indispensable role as the standard for file interchange. As we develop our implementation of this aspect of CELLAR, the growing literature from computer science on the topic of structured editors will be very instructive (Shaw 1980; Fraser and Lopez 1981; Biggerstaff and others 1984; Caplinger 1985; Hood 1985; Furuta 1986, 1987; Kimura 1986).

3. Multidimensional

A conventional text editor program views text as a one-dimensional sequence of characters. An SGML-based editor adds a second dimension -- namely, the hierarchical structure of the text. But from the perspective of a linguist or literary scholar, text has many more dimensions than that (Simons 1987). The stream of speech which we represent as a one-dimensional sequence of characters has form and meaning in many simultaneous dimensions. The speech signal itself simultaneously comprises articulatory segments, pitch, timing, and intensity. A given stretch of speech can be simultaneously viewed in terms of its phonetic interpretation, its phonemic interpretation, its morphophonemic interpretation, its morphemic interpretation, or its lexemic interpretation. We may view its structure from a phonological perspective in terms of syllables, stress groups, and pause groups, or from a grammatical perspective in terms of morphemes, words, phrases, clauses, sentences, paragraphs, and even higher level units. The meaning of the text also has many dimensions and levels. There is the phonological meaning of devices like alliteration and rhyme. There is the lexical meaning of the morphemes, and of compounds and idioms which they form. There is the functional meaning carried by the constituents of a grammatical construction. In looking at the meaning of a whole utterance, there is the literal meaning versus the figurative, the denotative versus the connotative, the explicit versus the implicit. All of these dimensions, and more, lurk behind that one-dimensional sequence of characters which we have traditionally thought of as text.

There is an existing program which has this multidimensional view of text, namely, the interlinear text processing system called IT (Simons and Versaw 1988). In this system, the user defines the dimensions of annotation that are desired. The resulting template defines what is called a frame in AI programming. The program then steps through the analysis of the text, ensuring that for each unit of the text the frame is filled in with the dimensions of annotation specified.

IT has proven itself to be the best tool currently available for developing an analyzed corpus of text, but it falls far short of the ideal. It does not support an adequate model of multilingual information in the text and its annotations. It does not support an adequate model of the hierarchical structure of texts; it supports only one level of structure (called the unit) above the word level. It does not support an adequate model of data integration (see next point); it integrates the text corpus with a simplistic lexical database, but not with a true lexicon or with a grammar or an ethnography.

We propose to synthesize the fundamental insights of SGML and IT in order to support a user-defined hierarchical structure of text elements, in which each text element has a user-defined frame structure of multidimensional annotations.

4. Integrated

The notion of hierarchies of frame-like text elements is still inadequate. It implies independence of text elements which happen to occur in separate hierarchies, or in different parts of the same hierarchy. But for the textual database on which linguistic, literary, or anthropological research is based, this is not so. Crosscutting the basic hierarchical organization of the elements in the database is a complex network of associations between them.

For instance, the words that occur in a text are comprised of morphemes. Those morphemes are defined and described in the lexicon. The relationship between the surface word form and the string of lexical morphemes which underlies it is described in the morphophonology. When a morpheme in an analyzed text is glossed to convey its sense of meaning, that gloss is really an attribute of one of the senses of meaning listed in the lexicon entry for that morpheme. The part of speech code for that use of the morpheme is another attribute of that same lexical subentry. The part of speech code itself is not ultimately the "property" of the lexicon. It is the grammar which enumerates and defines the possible parts of speech, and the use of a part of speech code in the lexicon is really a pointer to its description in the grammar. The examples which are given in the lexicon or the grammar relate back to the text from which they were taken. Cultural terms which are defined in the lexicon or cultural activities which are exemplified in the texts relate to the full analysis and description of such things in the ethnography. David Weber (1986) has discussed this network-like nature of the linguistic database in his description of a futuristic style of computer-based reference grammar.

Figure 2 illustrates some of the associative links that would be found in a database set up for linguistic field work. (Note that each link drawn in the diagram represents a kind of association of which there would be thousands of instances in a full database.) The text corpus serves as the source of examples for the ethnography, lexicon, and grammar. The flip side of this is that links from the texts to the ethnography and grammar are hypotheses about the analysis of things going on in the texts. These hypotheses as they relate to the lexicon have to do with the identification of categories and senses of meaning (by means of glosses) of forms occurring in the texts. The ethnography links to the lexicon for definitions of vernacular terms used in ethnographic description; the lexicon links to the ethnography for encyclopedic knowledge about words with cultural significance. The lexicon links to the grammar for the definition of grammatical categories and subcategories; the grammar links back to the lexicon for lists of the members of these categories and subcategories. The ethnography, lexicon, and grammar are full of cross-reference links between items within themselves.

Figure 2 Some associative links in a linguistic database

[figure missing]

This network of associations is part of the inherent knowledge structure of the phenomena we study. To maximize the usefulness of computing in our research, our computational model of the data must match this inherent structure. Being linked directly to relevant information elsewhere in the database has an obvious benefit in efficiency of retrieval. An even more fundamental benefit has to do with integrity of the data and quality of resulting work. Because the knowledge structures we deal with in research are networks of relationships, we can never make a hypothesis in one part of the database without affecting other hypotheses elsewhere in the database. (I am using hypothesis to cover any kind of guess from a simple one like the meaning of a single morpheme to a complex one like the strategy for identifying the referent of a pronoun.) Having the related information linked together makes it possible to immediately check the impact of a change in the database.

The addition of associative links to the data structure makes it possible to achieve the virtue of normalization, a concept which is well-known in relational database theory (Date 1981:237ff., Smith 1985). In a fully normalized database, a given piece of information exists only once in the database. Any use of that information is by reference to that single instance rather than by having a copy of it. For instance, the spelling of a part of speech code would ideally exist only once in a linguistic database (as an attribute of a part-of-speech entry in the grammar). Rather than repeating that code in the lexical entry for every morpheme with that part of speech, the lexical entries would point to the single piece of information in the grammatical part of the database. When the analyst decides to change the spelling of the abbreviation, all references are simultaneously updated, thus avoiding the ubiquitous database problem known as "update anomaly".

The kind of associative links in the database which achieve the integration we are talking about are the links of hypertext (Conklin 1987). Indeed, "hypertext" has become a buzz word of late and the commercial software houses are scrambling to develop so-called hypertext products. However, nothing we have seen comes close to meeting the needs of the linguist, literary scholar, or anthropologist. The hypertext systems coming out provide for arbitrary associative links from one point in a text file to another. Some, like HyperCard, can provide the frame concept in which the database can be comprised of records (with fields of information) which are associatively linked. But we know of no commercial product that merges the concepts of hierarchical structure, multidimensional frames, and associative links. Nor do we know of anything being developed in research labs; one of our project members (Steve DeRose) attended the recent state-of-the-art conference on hypertext sponsored by the Association for Computing Machinery, and none of the papers there hinted at this line of development. Add to this the need for a computing environment which can adequately cope with the multilingual nature of our information, and the gulf between what is now available and what we are proposing to build becomes even greater.

The synthesis of these four characteristics -- multilingual, structured, multidimensional, and integrated -- in a single data management environment is not just a luxury for linguistic, literary, and anthropological research. It is a necessity, for these four concepts epitomize the nature of the information we are seeking to process.

In the remaining five points we turn our focus to desirable properties of the computing environment that have more to do with the nature of computing than the nature of the data.

5. Seamless

The data should be accessible from within a seamless environment. Current systems involve two kinds of seams: the seam between on-line data and off-line data, and the seam between files and directories.

On floppy-disk-based microcomputers the seam between on-line data and off-line data is always present. In fact, at any moment, the vast majority of one's data is off-line. In such an environment, the kind of data integration discussed above is impossible to achieve. Currently available hard-disk systems are already doing away with the on-line/off-line seam. In fact, current high-capacity hard-disks are probably sufficient to handle the data management of the typical field project in which all of the information is to be collected for the first time.

For research involving existing text corpora and extensive reference libraries, such as work in the field of Biblical studies, a conventional hard disk is too small. The growing technology of CD-ROM (a read-only optical storage device based on the Compact Disk technology used widely today for audio recording) holds promise as a low-cost means of putting a library of of over 500M (approximately 200,000 pages) of text on-line. But advances in higher density magnetic media and in writable optical media continue relentlessly (Fujitani 1984, Hendley 1987). We propose to aim our development at the hard-disk environment for user-developed information and at the CD-ROM for prepackaged reference libraries, with confidence that as these media become too small for our purposes, the cutting-edge technologies will always be ahead of our capacity to fill them.

The second seam, that between files and directories, is a bit more subtle. Current computing environments store textual data in files. We commonly use text editors to find things, add things, delete things, and change things in these files. On top of the file system is built a hierarchical directory system which allows us to maintain order within the collection of potentially thousands of data files. We use operating system commands to find things, add things, delete things, and change things in directories. One inconvenience this produces is the necessity to learn different ways of performing what are generically the same operations. A more fundamental problem, however, is that the seam between directories and files is a boundary which search and match operations do not typically cross. This is a potentially serious shortcoming for analytical work.

Given the hierarchical view of textual data espoused above, the seam becomes even more artificial and troublesome. How does one decide which text element should correspond to the file, thereby relegating higher level elements to the realm of directories and lower level ones to the realm of markup within files? The right answer is obvious: there should be no distinction. The SGML-like view of a text

Figure 3 The seamless approach

[figure missing]

as a hierarchy of structured content elements is easily extended to view a library of texts as having a higher level hierarchical structure. This approach obviates the distinction between directories and files. The total environment is unified under a single command structure and search space. Figure 3 illustrates the difference between the conventional approach with hierarchical directories (manipulated by operating system commands) and files (manipulated by text editors), and the proposed CELLAR approach with a single seamless hierarchical data environment (manipulated by a single user interface system).

The resulting environment is more like a text editor than an operating system; however, at the user's demand, it must be able to provide a view of text elements at any level as atoms, much like a directory system presents files and subdirectories as atomic elements. There is a technology which already does this in the text editing environment, namely, outline processing. What is called for is a hierarchical system in which all information ultimately stems from a single root, and which supports an "outline view" that can hide levels of detail which are temporarily irrelevant.

6. Self-validating

In the above discussion of hierarchical structure, multidimensionality, and integration, I have already sown the seeds for the discussion of self-validation. Some examples have already been given which imply the concept. Here we tie these threads together to develop the point explicitly.

For textual information to be valid, it must have a valid hierarchical structure. For instance, if chapters are supposed to be made up of sections (and not vice versa), then a text having a chapter inside a section would not be valid. The SGML Document Type Definition discussed above is an example of an existing means for declaring what the valid structure of a document is. An SGML-based editor reads that definition and automatically validates any attempt to add to, delete from, or change the structure of the document to ensure that the result is always a valid document.

For textual information to be valid, its elements must have a valid frame structure for their annotations. For instance, if texts are supposed to be annotated by a genre attribute while the words in an analyzed text are supposed to be annotated by a grammatical category attribute, then a text annotated for grammatical category or a word annotated for genre would not be valid. The IT system described above has a simple form of validation like this. The user specifies an interlinear text model which declares to the system what the frame structure of the units of the analyzed text is to be. The interlinear text processor then enforces that model by ensuring that the output always conforms to the user-declared structure. It is impossible to produce an output which is structurally invalid.

For textual information to be valid, the associative links in it must be valid. For instance, if the "part of speech" attribute of a subentry in the lexicon is supposed to be a link to a "part of speech" entry in the grammar, then a lexical subentry in which the part of speech attribute was linked to a text example would not be valid. The interlinear text model used in the IT system also involves a simple form of link validation. In the model, the values of a particular annotation are declared to be taken from one of the mapping relationships or range sets stored in the lexical database. Only values in that mapping or range set are allowed to occur in the final analyzed text.

This notion of validation is related to the traditional notion of a schema in database theory (Date 1981:390, 407ff.). The database schema names all of the record types in the database, the fields which each contains, the type constraints on the values of each field, integrity tests for values of certain fields, the dependency relationships between different parts of the database, and so on. The database management system is always at work to ensure that all proposed transactions on the database result in a valid database.

This is what we propose to build for CELLAR. In our case there will be a schematic definition for each text element type. That definition will declare the names of the attributes (that is, the frame slots) for that type, and a validation constraint for values of each attribute. A validation constraint is a predicate which must always be true of a value assigned to that attribute. Where the value is supposed to be a sequence of constituent content elements, the validation constraint would specify a regular expression pattern over the subordinate content element types. Where the value is supposed to be an associative link, the validation constraint would specify the type of the referred-to element and possibly a further constraint on its attribute values.

The key notion behind self-validation is that once the schema for a content element is defined, the system should be able to automatically validate every attempt to create or alter an instance of that element type. The problem gets particularly complex when the schema itself is allowed to change, for a single change could invalidate all existing instances of a certain element type. Fortunately, computer scientists are studying this problem and solutions exist (Lafue and Smith 1986, Morgenstern 1986, Shepherd and Kerschberg 1986, Zdonik 1986, Marek 1987).

7. Knowledge-based

A new discipline, called knowledge engineering, has emerged during the past two decades. It deals with studying the knowledge that an expert in a particular field has so that it can be expressed explicitly in a form that is usable by others, including computer systems. A knowledge-based expert system is a computer program which is endowed with enough of an expert's knowledge that it can solve problems (in its limited problem domain) like that expert would. (See Duda and Gaschnig 1981 for an readable introduction to expert systems, and Hayes-Roth and others 1983 for details on how to build one.) This new knowledge processing paradigm provides a better perspective on what it is that linguists, literary scholars, and anthropologists want to do with computers than does the conventional data processing paradigm. Let me explain.

We first need to distinguish between two major categories of knowledge: general knowledge and specific (or instance) knowledge. A linguist, for instance, has general knowledge about how languages typically work and about how one goes about doing research on a language. A linguist's idea of what information belongs in a lexicon entry is an example of general knowledge. Specific instance knowledge is the knowledge the linguist garners about a particular language. All of the facts stored in the lexicon entry for a particular word are examples of instance knowledge.

The database we have been considering above which is multilingual, structured, multidimensional, and integrated is the repository of the instance knowledge that the researcher is working with. It turns out that the rich data structure we have developed involving hierarchy, frames, and associative links looks more like approaches used in artificial intelligence for knowledge representation than it does a conventional database. This should be no surprise; ultimately the researcher's task boils down to gathering, organizing, applying, and eventually creating knowledge. CELLAR thus draws on notions from the well-developed field of knowledge representation (see, for instance, IEEE 1983 or Brachman and Levesque 1985) in parts of its basic design.

Therefore, our conventional notion of a database is replaced by the notion of a store of instance knowledge. An even deeper change is that our conventional notion of data processing programs is replaced by the notion of general knowledge. Consequently, the traditional distinction between data and programs is unified under the single concept of knowledge. We speak of the store of instance and general knowledge as being the knowledge base.

Knowledge engineers talk about three kinds of general knowledge: descriptive, procedural, and heuristic. We will consider each of these in turn.

Descriptive knowledge is the knowledge an expert has about the general characteristics of the objects in the subject domain. These characteristics include names for the kinds of objects, names for their key parts, and knowledge about how different kinds of objects are related to each other. The kinds of objects correspond roughly to the high frequency common nouns that would be found in a textbook on the subject matter (Abbott 1983). For instance, a linguist has general, descriptive knowledge about what a lexicon is like. That knowledge about lexicons includes knowledge about the kinds of objects that are typically contained within lexicons, such as entries, lemmas, pronunciations, etymologies, subentries, senses, definitions, glosses, grammatical categories, examples, antonyms, and so on. Descriptive knowledge amounts to a definition of what each of these objects consists of. For instance, a lexicon entry could be defined as consisting of a lemma, a pronunciation, an etymology, and a set of subentries. This definition should further describe what kind of object each of the components is. For instance, the lemma is a character string from the language of interest, the pronunciation is a character string in an alphabet of phonetic symbols, and so on.

It turns out that we have already discussed this kind of definition in the preceding section. The schemata which are the basis of the self-validation of the knowledge base are in fact the general descriptive knowledge of an expert. Figure 4 illustrates some simplified descriptive knowledge about lexicons. The syntax used in the sample is meant to be suggestive only; CELLAR's descriptive knowledge component has not been fully worked out. In the sample, upper-case names are the names of element types and of their attributes. These are user definable. Note that strings must be identified as to which language they are from. "Names" is a special-purpose language which defines constraints on names in the CELLAR system. Note that the first attribute defined is the language interface system's name for the language of which this is a lexicon. The second attribute gives the correct name of the language in the script of the language.

Figure 4 Some descriptive knowledge

Element LEXICON has LANGUAGE:  string from "Names" NAME:      string
from (LANGUAGE of self) CONTENTS:  sequence of ENTRY

Element ENTRY has LEMMA:   string from (LANGUAGE of parent) PRONUNCIATION: 
string from "IPA" SENSES:  sequence of SUBENTRY

Element SUBENTRY has PART-OF-SPEECH:  reference to one of PARTS-OF-SPEECH
in GRAMMAR of (LANGUAGE of parent of parent) DEFINITION: string from
"English" GLOSS:      string from "English" EXAMPLES:   sequence of

and so on ...

Using techniques from knowledge engineering, it is possible to elicit from the expert a definition of all the basic objects of the subject domain. Each defined object becomes an element type of the knowledge base, and each definition is expressed in terms of a schema which declares what it means to be a valid instance of the object. The computer now knows (in part) what it means for something to be a lexicon, or a lemma, or a pronunciation, or a subentry, or whatever. When using CELLAR, such as to formulate a query, the user can use the common nouns of the subject domain, and the computer has an idea of what they mean. In fact, if the user is not an expert in the subject domain, it is possible that the computer will have internalized a better definition of the term than has the user. Because the descriptive knowledge component of CELLAR is always validating user attempts to update the knowledge base, this results in the non-expert user always creating instance knowledge which conforms to the expert's general definition of what that knowledge should be like. The result should be much higher quality work than when the non-expert user is given nothing more than a blank notebook or a word processor and asked to produce results.

Writing the schematic definition for an element type is a key form of programming in the CELLAR system. Traditional programs have a procedural and imperative style. They say, "Do this, then do this, then do this." That kind of programming is entirely inappropriate in this case. Rather, the schemata (as we have defined them above) are declarative programs. A schema says, "I declare that this must always be true, and I don't care how you, computer, determine and enforce this." It is the underlying computing environment that handles all of the messy computational details of ensuring that the declaration remains true at all times. The expert's descriptive knowledge is programmed into CELLAR by encoding it in schemata for the different types of objects in the subject domain. This formalized descriptive knowledge could just as well be the basis of the introductory chapter in a college textbook on the subject domain.

The second kind of general knowledge is procedural knowledge. This is knowledge about how to do things with the instance knowledge in the knowledge base. Procedural knowledge is what results when one answers the question, "How could I transform this knowledge already in the knowledge base into a new bit of (derived) knowledge?" (Cases involving guessing or inferring are excluded here; they are dealt with under heuristic knowledge.) For instance, it is procedural knowledge that enables an analyst to determine if two items co-occur, to find all instances of a pattern of interest, to compile an inventory of all the elements of a given type, to compute frequencies of occurrence, to make a consistent change in the transcription of data, to extract simplified representations of data of interest out of the complex network of information, and so on.

There is a long tradition of representing procedural knowledge in programs written in conventional procedural programming languages. The study of programming languages has shown, however, that procedural programming is fraught with many traps for the unwary. We propose, instead, to use the functional programming paradigm (Backus 1978, Henderson 1980). It has the advantage of a side-effectless semantics, which means that it would be easier for the system to do reasoning about the effect and the correctness of programs. A particular benefit of this will be that the generation of user-defined programs can be encapsulated in a structured editor which uses not only syntactic but also semantic knowledge about the programming language to help the user verify the program as it is being written (Dybvig and Smith 1985, Lindstrom 1986, Bidoit and others 1984, Yemini and Berry 1987).

As far as the CELLAR application is concerned, the "programs" referred to in the preceding paragraph are more like the queries of a database language. Specifically, procedural knowledge in CELLAR will be expressed in a functional query language (modeled after Buneman, Frankel, and Nikhil 1982, and Gonnet and Tompa 1987). This knowledge will not only be encoded in stand-alone functions which the user can execute on given data objects, but it will also be the means by which complex constraints in the schematic knowledge of the descriptive component are expressed. Figure 5 gives a sample of what a procedure in CELLAR's query language might look like. The example is a procedure for finding the longest sentence (or sentences) in a text.

Figure 5 Some procedural knowledge

longest_sentence(Text) means select(sentences_in(Text), (S such_that
size(S)==M)) where M is max(apply(sentences_in(Text),size)).

The third kind of general knowledge is heuristic knowledge. Heuristics (from the Greek word for 'discover') are the bits of experiential knowledge an expert applies in the process of discovering the solution to a problem. Heuristics are not generally well-defined procedures; they are often not even well-justified. Rather, they are based on hunches, or on analogies to previously encountered situations. This kind of human problem solving has traditionally fallen beyond the capabilities of algorithm-based computing. This has changed in the last decade, however. A new technology called knowledge-based expert systems has broken this barrier and ushered in what many are calling the "second wave" of the information revolution. Whereas the first wave automated data processing, the second wave is beginning to automate decision making (Linden 1988:61).

Heuristic knowledge is typically represented in expert systems by means of if-then rules. They say something like, "If pattern X is present, then you can (probably) conclude Y." Or, "If situation X arises, then test the hypothesis that Y is the case." The body of heuristic knowledge is stored in a database rather than being written into programs. An inference engine applies the body of heuristic knowledge to the details of the situation at hand in order to reach a conclusion. Figure 6 gives a sample of some of the heuristic knowledge involved in the linguistic problem of determining whether two phones reflect the same or different phonemes.

Figure 6 Some heuristic knowledge

Phone1 contrasts-in-identical-environment-with Phone2, THEN Phone1
is-different-phoneme-than Phone2.

IF   Phone1 contrasts-in-analogous-environment-with Phone2, THEN Phone1
is-different-phoneme-than Phone2.

IF   Phone1 occurs-in Word1  AND  Phone2 occurs-in Word2  AND Phone1
is-different-than Phone2  AND Word1 contrasts-with Word2  AND Phone1 and
Phone2 are-only-difference-between Word1 and Word2, THEN Phone1
contrasts-in-identical-environment-with Phone2.

IF   Phone1 occurs-in Word1  AND  Phone2 occurs-in Word2  AND Phone1
is-different-than Phone2  AND Word1 contrasts-with Word2  AND Phone1 and
Phone2 are-in-analogous-environment-in Word1 and Word2, THEN Phone1
contrasts-in-analogous-environment-with Phone2.

IF   Word1 has-different-form-than Word2  AND Word1
has-different-meaning-than Word2, THEN Word1 contrasts-with Word2.

IF   X is-position-of Phone1 in Word1  AND Y is-position-of Phone2 in Word2
 AND Word1 before-position X  equals  Word2 before-position Y AND Word1
after-position X  equals  Word2 after-position Y, THEN Phone1 and Phone2
are-only-difference-between Word1 and Word2.

IF   X is-position-of Phone1 in Word1  AND Y is-position-of Phone2 in Word2
 AND Word1 before-position X  is-analogous-to Word2 before-position Y  AND
Word1 after-position X  is-analogous-to Word2 after-position Y, THEN Phone1
and Phone2 are-in-analogous-environment-in Word1 and Word2.

Expert systems clone the knowledge of an expert in a particular problem domain, and thus make it possible for a computer user lacking such expertise to use that expert knowledge in solving a problem. Having an expert system on your computer can be like having an expert consultant at your side. Many problems in linguistic field work could be addressed in this way. An expert system could be used to answer questions like "Are X and Y variants of the same phoneme, or separate?" "How many emic levels of tone does this language have?" "How might this discourse particle be functioning?" "What strategy could I use to translate this term which is lacking in the target language?" "Is this language viable, or is it undergoing language death?" The approach taken in the implementation of expert systems is usually to work backwards from possible conclusions. Thus the user need not have assembled all the relevant facts of the situation. The expert system will perform an automated query on the knowledge base, or simply ask the user a question, in order to find the relevant aspects of the situation. It then branches to different sets of questions depending on which way the answers go.

The data storage mechanisms of CELLAR already provide the means for storing and organizing heuristic knowledge. The functional query component provides the means for testing the presence of conditions. All that is needed to incorporate an expert reasoning component in CELLAR is to implement an inference engine that is integrated with the data storage and query components.

Figure 7 summarizes the discussion by giving a block diagram of how the basic knowledge components of CELLAR interact.

Figure 7 CELLAR as a knowledge-based system

[figure missing]

8. Extensible

The CELLAR environment must be extensible. We recognize that it would never be possible to deliver a computing system for linguistic, literary, and anthropological research that would meet all the needs of all researchers. Clearly it would be impossible to anticipate all the kinds of information that would be stored and all the ways in which they would be manipulated. Even for kinds of information and manipulations that are anticipated, it would be impossible to package these in a way that would meet the precise needs or preferences of every researcher. The researcher must be able to modify and customize the configuration of the system. The researcher must also be able to add new kinds of information or new ways to manipulate the existing information.

CELLAR is not an application program; it is an extensible, high-level programming environment. An application program for doing X amounts to the descriptive, procedural, and heuristic knowledge about X that are entered into the system. The end user will have access to all of the knowledge installed by the experts who built a particular application. The end user will also have access to the application development tools which will make it possible to enter new general knowledge into the system, or to modify general knowledge which has already been defined.

9. Iconic

To complete the programming environment we need a front end for building the user interface to applications. Recent work in human factors research has shown that the event-driven WIMP (windows, icons, mouse, and pulldown menus) style of user interface is the easiest to learn and use. At the root of this is its iconicity. In general, an icon is a symbol that imitates something in real life. Linguists have picked up on this notion in analyzing linguistic signs. A symbol is iconic if there is a natural (as opposed to arbitrary) relationship between its form and its meaning (Givon 1984:30). Clearly, an iconic user interface should be easy to learn and to remember.

There are many iconic features in a WIMP style of interface. The graphical icons used to represent data objects and programs are just one such feature. The windows are icons for active tasks. Just like the papers on one's desktop, windows stack up and overlap as one task interrupts another. The mouse and the screen pointer which it controls are the most powerful icons in the system. The user selects a data object to manipulate, a task to reactivate, a procedure to run, or a menu option to choose, all by moving the mouse pointer to the on-screen representation of the item and pressing the button on the mouse to say, "This is the one I want." There is no command language to learn -- just the natural metaphor of pointing. Another iconic function performed with a mouse is called "dragging". Items are moved or copied by pointing to them, pressing the mouse button to select them, and then moving the mouse while the button is still pressed down. Ass long the buttons stays down, the item is carried along on the screen with the mouse pointer. Files are deleted by dragging them into a garbage can icon. The options of a menu are displayed for view by dragging down on the menu's title -- an operation much like pulling down a window blind.

Underlying this iconic style of user interface is a programming strategy which is event-driven and object-oriented. In an event-driven program, the highest level is an event loop which continually monitors the console for the occurrence of events. The most important events in user interfaces are the pressing of keys on the keyboard, the movement of the mouse, and the pressing or lifting of the mouse button. The event loop looks for such an event to occur. In an object-oriented program, all of the program code is encapsulated within the definitions of the data objects. Rather than passing data to programs as in conventional programming, messages are sent to data objects to tell them, "Do such-and-such to yourself." If the object has a definition for how to do that, it responds appropriately; otherwise, an error is generated.

In an event-driven object-oriented program, whenever an event occurs, the event loop checks to see what screen object the pointer is on. The message that the event has occurred is then sent to the internal object which that screen object represents. The internal object checks its definition to see if it knows how to respond to that event. If so, it acts accordingly.

The opposite of event-driven programming is modal programming. This is the conventional style of programming employed in question/answer, menu, and command language styles of user interface. In such interfaces, the user is always in a mode -- that is, the user always has a limited number of keyboard responses that are valid, depending on the expectations of the particular program subroutine being executed at the time. If the user is in state X and wants to get into state Y (so as to perform a task allowed only in that mode), then the interface requires that the user navigate through the predefined modes to achieve a path from X to Y. This proves frustrating to users because what one part of the program conditions them to think of as natural is often disallowed in another part; it is always necessary to navigate to the right mode to get access to desired operations. In a nutshell, the hierarchical control that the programmer built into the subroutine calling structure controls how the user may interact with the program and data.

In the event-driven object-oriented paradigm, this hierarchy of modes. Program actions are associated with screen objects and they are invoked by user generated events happening on those screen objects. If the user is doing X and sees Y on the screen and wants to do Y next, then he simply moves the pointer to Y and generates the appropriate event.

Not only is this approach easier for the user, it can also be easier for the programmer. If the event loop processor is given to the programmer as the basic system monitor, then programming a user interface becomes a matter of linking data objects to screen objects and attaching event-handling scripts to them. We propose to make this aspect of programming in CELLAR just as user-accessible and extensible as the knowledge-based aspects discussed above.

The Macintosh computer's operating system has built-in support for the WIMP style of interface. The MacApp package for Macintosh software developers provides an event-driven, object-oriented framework out of which system programmers can build applications using this iconic style of user interface (Schmucker 1986, 1988). The concept of attaching event-handling scripts to objects represented on the screen is well developed in the HyperTalk programming language of the Macintosh HyperCard system (Goodman 1987, Shafer 1988). We will build on all of these ready-made solutions in building the user interface to CELLAR.


[1]This design has grown out of a three-day consultation on "The Computing Environment for Linguistic and Literary Research" held in Octboer of 1987 at SIL's center in Dallas, TX. This consultation was convened in order to help me, in my capacity as director of SIL's academic computing department, begin developing plans for SIL's next major software development effort. I am therefore indebted to all who participated in the consultation. The full-time participants were: Alan Buseman, Mike Colburn, Robin Cover, Scott Deerwester, Steven DeRose, Steve Echerd, Joseph Grimes, Eugene Loos, Marc Rettig, Bruce Samuelson, and John Thomson. I am further indebted to John Thomson, Steve DeRose, and Robin Cover who have helped substantially in the further refinement of the design through both personal interaction and written comments on earlier drafts of this paper.


AAP. 1986. Standard for electronic manuscript preparation and markup. Electronic Manuscript Series. Washington, D.C.: Association of American Publishers.

-------. 1986. Author's guide to electronic manuscript preparation and markup. Electronic Manuscript Series. Washington, D.C.: Association of American Publishers.

Abbott, Russell J. 1983. Program design by informal English descriptions. Communications of the ACM 26(11):882-894.

Apple Computer. 1985. The Font Manager. In Inside Macintosh, volume 1, pages 215-240 (with updates in volume 4, pages 27-48, 1986). Reading, MA: Addison-Wesley.

-----. 1987. The Script Manager. In Inside Macintosh, volume 5, pages 1-29. Draft version published by Apple Programmers and Developers Association. To appear in Addison-Wesley series.

-----. 1987. Script Manager developer's package 10: release note. Apple Programmers and Developers Association, 1987. Includes "Testing with the Script Manager" (10 pp.), "Script Manager hints and recommendations" (15 pp.), "Kanji Talk 1.1 usage note" (24 pp.), "Arabic Interface System (AIS) 1.1 usage note" (19 pp.).

Backus, John. 1978. Can programming be liberated from the von Neuman style?: a functional style and its algebra of programs. Communications of the ACM 21(8):613-641.

Barnard, David, Ron Hayter, Maria Karababa, George Logan, and John McFadden. 1988. SGML-based markup for literary texts: two problems and some solutions. Technical Report #204. Kingston, Ontario: Department of Computing and Information Science, Queens University.

Becker, Joseph D. 1984. Multilingual word processing. Scientific American 251(1):96-107.

-----. 1987. Arabic word processing. Communications of the ACM 30(7):600-610.

Bidoit, M. and others. 1984. Exception handling: formal specification and systematic program construction. Proceedings of the 7th International Conference on Software Engineering, pages 18-29. IEEE.

Biggerstaff, Ted J., D. Mack Endres, and Ira R. Forman. 1984. TABLE: object-oriented editing of complex structures. Proceedings of the 7th International Conference on Software Engineering. IEEE. pages 334-345.

Brachman, Ronald J. and Hector J. Levesque. 1985. Readings in Knowledge Representation. Los Angeles, CA: Morgan Kaufmann.

Buneman, Peter, Robert E. Frankel, and Rishiyur Nikhil. 1982. An implementation technique for database query languages. ACM Transactions on Database Systems 7(2):164-182.

Caplinger, Michael. 1985. Structured editor support for modularity and data abstraction. SIGPLAN Notices 20(7):140-147.

Chicago, University of. 1987. Chicago guide to preparing electronic manuscripts: for authors and publishers. Chicago: University of Chicago Press.

Comrie, Bernard and Norval Smith. 1977. Lingua Descriptive Studies: questionnaire. Lingua 42:1-72.

Conklin, Jeff. 1987. Hypertext: an introduction and survey. IEEE Computer, September 1987, pages 17-41.

Coombs, James H., Allen H. Renear, and Steven J. DeRose. 1987. Markup systems and the future of scholarly text processing. Communications of the ACM 30(11):933-947.

Date, C. J. 1981. An introduction to database systems (third edition). Reading, MA: Addison-Wesley.

Davis, Mark Edward. 1987. The Macintosh script system. Newsletter for Asian and Middle Eastern Languages on Computer 2(1&2):9-24.

Duda, Richard O. and John G. Gaschnig. 1981. Knowledge-based expert systems comes of age. BYTE 6(9):238-281.

Dybvig, R. Kent and Bruce T. Smith. 1985. A semantic editor. SIGPLAN Notices 20(7):74-82.

Fraser, Christopher W. and A. A. Lopez. 1981. Editing data structures. ACM Transactions on Programming Languages and Systems 3(2):115-125.

Fujitani, Larry. 1984. Laser optical disk: the coming revolution in on-line storage. Communications of the ACM 27(6):546-554.

Furuta, Richard. 1986. An integrated, but not exact representation, editor/formatter. Ph.D. dissertation, University of Washington.

-----. 1987. Complexity in structured documents: user interface issues. Paper given at PROTEXT IV, Boston, October 1987. To appear in proceedings.

Givon, Talmy. 1984. Syntax: a functional-typological introduction. Amsterdam: John Benjamins.

Gonnet, Gaston and Frank W. Tompa. 1987. Mind your grammar: a new approach to modeling text. Technical report OED-87-01, University of Waterloo Centre for the New Oxford English Dictionary.

Goodman, Danny. 1987. The complete HyperCard handbook. Toronto: Bantam Books.

Grimes, Joseph E. To appear. Information dependencies in lexical subentries. To appear in a volume on the lexicon in computational linguistics, edited by Martha Evans.

Hayes-Roth, F., D. A. Waterman, and D. B. Lenat. 1983. Building expert systems. Reading, MA: Addison-Wesley.Henderson, Peter. 1980. Functional programming: application and implementation. Englewood Cliffs, NJ: Prentice Hall.

Hendley, Tony. 1987. CD-ROM and optical publishing systems. Westport, CT: Meckler Publishing Corp.

Hood, Robert. 1985. Efficient abstractions for the implementation of structured editors. SIGPLAN Notices 20(7):171-178.

IEEE Computer Society. 1983. Special issue on knowledge representation. Computer 16(10).

ISO. 1986. Information processing -- text and office systems -- Standard Generalized Markup Language (SGML). ISO 8879-1986 (E). Geneva: International Organization for Standards, and New York: American National Standards Institute.

Johnson, Jeff and Richard J. Beach. 1988. Styles in document editing systems. IEEE Computer, January 1988, pages 32-43.

Kimura, Gary D. 1986. A structure editor for abstract document objects. IEEE Transactions on Software Engineering SE- 12(3):417-435.

Lafue, Gilles M. E. and Reid G. Smith. 1986. Implementation of a semantic integrity manager with a knowledge representation system. In Expert Database Systems: proceedings from the first international workshop, edited by Larry Kerschberg. Menlo Park, CA: Benjamin/Cummings. Pages 333-350.

Linden, Eugene. 1988. Putting knowledge to work. Time, March 28, 1988, pages 60-63.

Lindstrom, Gary. 1986. Static evaluation of functional programs. SIGPLAN Notices 21(7):196-206.

Marek, W. 1987. Completeness and consistency in knowledge base systems. In Proceedings from the First International Conference on Expert Database Systems, edited by Larry Kerschberg. Menlo Park, CA: Benjamin/Cummings. Pages 119-126.

Morgenstern, Matthew. 1986. The role of constraints in databases, expert systems, and knowledge representation. In Expert Database Systems: proceedings from the first international workshop, edited by Larry Kerschberg. Menlo Park, CA: Benjamin/Cummings. Pages 351-368.

Murdock, George P. and others. 1961. Outline of cultural materials (4th edition). New Haven, CT: Human Relations Area Files.

Schmucker, Kurt. 1986. Object-oriented programming for the Macintosh. Hasbrouck Heights, NJ: Hayden Books.

-----. 1986. MacApp: an application framework. Byte 11(8):189-194.

-----. 1988. Using objects to package user interface functionality. Journal of Object-Oriented Programming 1(1):40-45.

Shafer, Dan. 1988. HyperTalk programming. Indianapolis, IN: Hayden Books.

Shaw, Alan C. 1980. A model for document preparation systems. Technical report 80-04-02, University of Washington, Department of Computer Science.

Shepherd, Allan and Larry Kerschberg. 1986. Constraint management in expert database systems. In Expert Database Systems: proceedings from the first international workshop, edited by Larry Kerschberg. Menlo Park, CA: Benjamin/Cummings. Pages 309-331.

Simons, Gary F. 1987. Multidimensional text glossing and annotation. Notes on Linguistics 39:53-60.

Simons, Gary F. and Larry Versaw. 1988. How to use IT: a guide to interlinear text processing (revised edition, version 1.1). Dallas, TX: Summer Institute of Linguistics.

Smith, Henry C. 1985. Database design: composing fully normalized tables from a rigorous dependency diagram. Communications of the ACM 28(8):826-838.

Weber, David. 1986. Reference grammars for the computational age. Notes on Linguistics 33:28-38.

Yemini, Shaula and Daniel M. Berry. 1987. An axiomatic treatment of exception handling in an expression-oriented language. ACM Transactions on Programming Languages and Systems 9(3):390-407.

Zdonik, Stanley B. 1986. Maintaining consistency in a database with changing types. SIGPLAN Notices 21(10):120-127.

Document date: 31-Oct-1995