SIL Electronic Working Papers 1997-008, December 1997
Copyright © 1997 Gary F. Simons and Summer Institute of Linguistics, Inc.
All rights reserved.
Gary F. Simons
Author's note: This paper was originally presented at SIL's General CARLA Conference, 14-15 November 1996, Waxhaw, NC. CARLA, for Computer-Assisted Related Language Adaptation, is the application of machine translation techniques between languages that are so closely related to each other that a literal translation can produce a useful first draft. Developing the DTD for parsed texts was part of my contribution as a member of a CARLA Design Team appointed by the SIL administration. I am deeply indebted to the two other members of this team, Andy Black and Bill Mann, for their key contribution in developing the conceptual framework on which the PTEXT DTD is based and in critically reviewing successive versions of it.
This paper describes a file format based on SGML that has been designed for the interchange of morphologically and syntactically parsed texts among natural language processing applications. A second prupose is to serve as an archival format for parsed text. After discussing the requirements for the file format, the resulting format, named PTEXT (for "parsed text"), is described and exemplified. The full details are given in the commented SGML Document Type Definition (DTD) for the PTEXT format which is supplied with the paper.
In considering the future development of CARLA software, the CARLA Design Team was concerned to place it in its broader context. That context is natural language processing (NLP). Adapting text from one language to another is just one way of applying natural language processing tools. Other applications include spelling checking, grammar checking, hyphenation, formal testing of linguistic analyses, and more. Many NLP tools have already been developed within SIL--for instance, AMPLE, STAMP, PC-KIMMO, PC-PATR, Hermit Crab, TonePars--and many more are likely to be developed in years to come. Developing a complete NLP application of the complexity of CARLA is a huge job. Rather than approaching this as a monolithic piece of software, it is more strategic to reuse smaller tools to perform specialized processes within the overall task. The CARLA Design Team concluded that a framework for developing future NLP applications needs a common conceptual model for the parsed texts on which the various processes operate, and a common file format for allowing these parsed texts to be interchanged freely among the processes.
The format that was designed for this purpose is named PTEXT, for "parsed text." Section 2 of this paper lists the requirements for this format that stem from the ontology, or essential nature, of parsed texts. Section 3 lists requirements that stem from the need to store parsed texts (such as fully analyzed source texts for translation) in an archive for eventual reuse. Section 4 gives an overview of the PTEXT format that was devised to meet these two sets of requirements. Section 5 gives an example of a parsed text in PTEXT format as it goes through three successive processes. Section 6 presents the full details of PTEXT by giving the commented DTD (document type definition) that implements PTEXT in SGML. Finally, section 7 concludes by discussing how the PTEXT formalism not only provides a format for interchange, but also defines a conceptual model for the information that NLP processes operate on.
This section lists requirements for parsed texts that have to do with the essential nature of parsed texts. The list does not attempt to be exhaustive; rather, it highlights points that are true of parsed text in general but which are not supported by the ANA file format that is already in use with AMPLE.
All of the above requirements are met by the parsed text interchange format proposed below.
This section lists requirements for parsed texts that stem from our need to archive them. When someone wants to do transfer and synthesis from a source text that was analyzed and archived by someone else 10 years earlier, then it is essential that the archived form of the parsed text be both self-documenting and self-contained. That is, the future user should not need to look any further than the parsed text itself to understand and use it. Out of this principle flow the following requirements:
All of the above requirements are met by the parsed text interchange format proposed below.
The PTEXT interchange format is implemented as an application of SGML, the Standard Generalized Markup Language. There are two basic reasons why SGML was used for this purpose as opposed to SIL's Standard Format convention. First, Standard Format does not have the expressive power to handle requirements like the need to have recursive hierarchical representations (such as for parse trees and feature structures) and the need to normalize information by pointing to a shared instance. Second, the Document Type Definition (DTD) of SGML (see section 6 below) not only provides formal documentation of the markup scheme, but also allows public domain parsers to validate the integrity of PTEXT instances. This is an indispensable capability for software developers and end users alike when the markup scheme is so rich.
As an introduction to the basic features of SGML markup, consider the following lexical entry for the English morpheme time which is marked up following the PTEXT scheme:
<lex id=lx0001 type=root cat=N> <form>time</> <gloss lng=SPN>tiempo</> </lex>
In SGML, information is represented as a structure of hierarchically embedded elements. Each element is marked by a matching start tag and end tag, in this case <lex> and </lex>. An end tag can be abbreviated to just </> when the element contains no embedded elements; this is done for <form> and <gloss>. In addition to its content (that is, the material embedded between the start tag and the end tag), an element can be further enriched with attributes which are encoded within the start tag. In this case the lexical item declares attribute values for a unique identifier, a lexical type, and a syntactic category, while the gloss declares its language.
The next example gives an overview of the structure of a PTEXT document. The comments within it (delimited by <!-- and -->) explain the function of the top-level elements within a PTEXT. The one thing that requires further explanation is the opening !DOCTYPE declaration. This is an SGML keyword that declares the document type (in this case "ptext") and names the system file in which its document type definition (or DTD, see section 6) is to be found. The SGML parser uses this declaration to read the DTD and then validate the integrity of the document instance.
<!DOCTYPE ptext SYSTEM "ptext.dtd"> <ptext> <header>A comment about what is in this file</header> <pedigree> <!-- The history of processes that produced this file --> </pedigree> <declarations> <!-- Declarations of languages used, their case mappings, morphosyntactic categories, and type codes used for lexical items, glosses, and annotations --> </declarations> <lexicon> <!-- A complete list of all lexical items (e.g. morphemes, idioms) used in the analysis of this parsed text, along with gloss, type, category, feature structure, and so on --> </lexicon> <wordforms> <!-- A complete list of all the wordforms that occur in this parsed text along with their analyses --> </wordforms> <puncforms> <!-- A complete list of all the punctuation forms that occur in this parsed text --> </puncforms> <text> <!-- The text itself; represented as a sequence of segments (typically sentences) that contain an orthographic form, a phrase-structure analysis, and annotations (e.g. translations) --> </text> </ptext>
One more feature of SGML markup needs to be explained before the examples which follow will make sense. This is the ID and IDREF mechanism for encoding pointers. In the DTD, certain attributes are declared as having ID (for "identifier") for their value type. This means that the value must be a string that uniquely identifies the element that the attribute is on. The SGML parser ensures that no two elements have the same ID string. In the PTEXT DTD, all attributes that take ID values are also named "id". For instance, here is a PTEXT syntactic category declaration with the unique ID of "N":
Other attributes are declared to have values of type IDREF (for "identifier reference"). This means that the attribute value must be a string that is the unique identifier of another element. The SGML parser ensures that all IDREF values are indeed the ID of an element elsewhere in the document. On the <lex> element, for instance, the cat attribute is defined to take an IDREF. Thus, the lexical item
<lex id=lx0001 type=root cat=N>
is pointing to the <cat> element given above as the definition of its category. Similarly "root" is a pointer to the definition of a lexical type. And in the following word structure analysis,
<ws id=ws0001 cat=N><m lex=lx0001></ws>
the morpheme element (<m>) points to the above lexical item as its content. Similarly, the following word element (<w>) in the text proper points to this word structure as its analysis (via the ana attribute).
<w form=wf001 ana=ws0001>
With this brief introduction to the mechanics of SGML markup, it is hoped that the reader will be able to work through the examples which follow in the next section. Consult the fully commented DTD in section 6 for a discussion of the meaning of each markup element.
This section gives an example of parsed texts that are encoded according to the PTEXT model. The example consists of a series of three SGML files that are encoded according to the PTEXT DTD. (The complete files are supplied here so that you may download them along with the DTD, which is given in section 6, and experiment with them.) The sample files represent successive stages in the analysis of the following input text:
\p Time flies.
The three programs through which this text is run are hypothetical. Note, however, that the integrity of the sample PTEXT files has been validated by running them through a public domain parser named SGMLS. Note, too, that some redundant sections of the files have been edited down to comments to conserve space.
The first process we run on this text is a TextIn processor that converts Standard Format files to PTEXT format. The result is as follows:
The next step runs this PTEXT file (timefly1.sgm) through a morphological analyzer. The changes to the file are the addition of another <process> to the <pedigree>, the addition of <categories> and <lexicon>, and the addition of word structure analyses (<ws>) to the wordforms (<wf>). The <text> portion is unchanged. The output is as follows:
The final step runs this PTEXT file (timefly2.sgm) through a disambiguation process. The only changes to the file are the addition of the DISAMBIG process to the <pedigree>, and the changes within the <s> element in the <text> portion of the file. The input file implicitly permits four readings: N N, N V, V N, and V V. This process narrows the possibilities to N V versus V N and uses the ana attribute on <w> to make these two readings explicit. The output is as follows:
This section supplies the complete DTD (document type definition) for the PTEXT interchange format. After each element definition there is a comment that explains the purpose of that element and its attributes. At a minimum, these should be read to get a sense of the full range of phenomena that are supported by the PTEXT formalism.
To understand the definition more deeply it is necessary to understand some things about the SGML definition language. Here, for instance, is the definition of <lex>:
<!ELEMENT lex - - (form, adapt?, fs?, gloss*) > <!ATTLIST lex id ID #REQUIRED lang IDREF #IMPLIED type IDREF #IMPLIED cat IDREF #IMPLIED >
The !ELEMENT declaration has three parts: the name, tag omission controls, and the content model. The "- -" in this example means that neither the start tag nor the end tag may be omitted. When the declaration has "- o" it means that the end tag may be omitted. The content model is a regular expression that declares what elements (and in what pattern of occurrence) are allowed between the start tag and the end tag. The regular expressions are formed with the following operators:
? optional (zero or one)
* zero or more
+ one or more
Two special keywords may occur as the content model. EMPTY means that the element has no content; it is represented only by a start tag. #PCDATA means that the content is character data with no embedded elements.
The attribute list declaration (!ATTLIST) names the element for which the attributes are being defined and then declares each attribute in three parts: an attribute name, the attribute type, and the default value. The types ID and IDREF are explained above in section 4. Other possible value types are: IDREFS (permits multiple IDREFs), CDATA (an arbitrary character string), and a parenthesized list of fixed values. #REQUIRED means that there is no default value for the attribute; the markup must supply a value. #IMPLIED means that the value is optional in the markup; the application will infer the value if it is missing (typically as nil).
A final SGML keyword that occurs in this DTD is !ENTITY. This associates a name with a value. When %name; occurs in the DTD, the value associated with that name in an !ENTITY declaration is substituted in place.
The markup scheme defined by the DTD is a new one, with one exception: the markup for feature structures (the <fs> tag and everything it may contain) is taken from the Text Encoding Initiative's guidelines. Here follows the complete DTD:
The series of examples in the preceding section illustrates the use of PTEXT as an interchange format. The three programs by which the text has been successively processed could have been written by three different programmers in three different programming languages. Furthermore, each could have worked without knowledge of the other two programs, yet the programs are guaranteed to work together because all are written to a common information interchange format.
The PTEXT framework does not require that all exchange of information between processes take place via SGML-encoded strings; it requires only that the parsed text output files (since they have the potential of being archived) be mapped into the interchange format. For instance, a single program might integrate a number of processes. Its input function would read the SGML-encoded PTEXT file and map the information into a data structure that was compatible with (a subset of) the PTEXT model. The processes inside the program would then operate on this data structure and pass it as the output of one process to the input of the next. On completion of the final process, the information would be written back out to a file in the PTEXT interchange format.
Note that the PTEXT formalism is more than just an interchange format; it also provides a conceptual model for parsed texts. In particular, the ontological requirements listed in section 2 specify aspects of the conceptual model of parsed texts that natural language processing applications should support (for instance, both morphological analysis and syntactic analysis, tree-structured analyses, feature structures, and so on). This is not to say that any application that employs the PTEXT interchange format must use all these conceptual features; it is perfectly acceptable for an application to use only a subset of the conceptual model as long as it maps what it uses onto the proper interchange markup. A conforming application should also pass material in the input it does not use unchanged to the output. Development of applications that use the PTEXT format is already underway.
1. More information (including the program itself in many cases) is available on the Web about each of these programs:
2. The CARLA tools that are now in use within SIL use the ANA (for "Analysis") format. Here is an example of an ANA file.
3. SGML was adopted by the International Organization for Standardization in 1986 as international standard ISO 8879. Two excellent books for getting started with SGML are ABCD SGML: A Users Guide to Structured Information, by Liora Alschuler (International Thomson Computer Press, 1995) and Practical SGML, by Eric van Herwijnen (Kluwer Academic Publishers, 1990). Another outstanding resource is Robin Covers SGML Web site at http://www.sil.org/sgml/.
4. See Chapter 16, "Feature Structures," of Guidelines for the encoding and interchange of machine-readable texts, edited by C. M. Sperberg-McQueen and Lou Burnard, Chicago and Oxford: Text Encoding Initiative (1994). See also "A rationale for the TEI recommendations for feature structure markup," by D. Terence Langendoen and Gary F. Simons, Computers and the Humanities, 29:191-209 (1995).
5. The first application to appear was IT2PTEXT. It converts interlinear text files in ITX and Shoebox format into PTEXT format. It is distributed as part of the LinguaLinks product and used as an interchange format for its interlinear text analysis component.
Date created: 19-Dec-1997
[SILEWP 1997 Contents | SILEWP Home | SIL Home]