SIL Electronic Working Papers 1998-006, December 1998
Copyright © 1998 Gary F. Simons and Summer Institute of Linguistics, Inc.
All rights reserved.


A paper presented at Markup Technologies '98, Chicago, 19-20 Nov 1998

Using architectural processing to derive small, problem-specific XML applications from large, widely-used SGML applications

Gary F. Simons


Contents:

Abstract
1. Introduction
2. Three problems
2.1. The problem of transition to XML
2.2. The problem of customization
2.3. The problem of "fatware"
3. Architectures to the rescue
4. Solving the three problems
4.1. Addressing the transition to XML
4.2. Addressing "fatware"
4.3. Addressing customization
4.4. A complete example
5. Validating data against both DTDs
6. An architectural approach to TEI conformance
7. Conclusion
References

Abstract

The large SGML DTDs in widespread use (e.g. HTML, DocBook, ISO 12083, CALS, EAD, TEI) offer the advantage of standardization, but for a particular project they often carry the disadvantage of being too large or too general. A given project might be better served by a DTD that is no bigger than is needed to solve the specific problem at hand, and that is even customized to meet special requirements of the problem domain. Furthermore, the project might prefer for the data it produces to meet the different syntactic constraints of XML conformity. This paper demonstrates how architectural processing can be used to develop a problem-specific XML DTD for a particular project without losing the advantage of conforming to a widely-used SGML DTD. As an example, the paper discusses the markup for a dictionary of the Sikaiana language (Solomon Islands) and develops a small XML application for the purpose derived from the TEI (Text Encoding Initiative) DTD. The TEI Guidelines offer a mechanism for building TEI-conformant applications; the paper concludes by proposing an alternative approach to TEI conformance based on architectures.


1. Introduction

The work described in this paper began as an effort to perform a particular markup task. Back in 1983 while doing linguistic field work in the Solomon Islands, I helped anthropologist William Donner (then a graduate student at the University of Pennsylvania) to produce a bilingual dictionary of the Sikaiana language [Don87]. For the purpose, we devised a one-of-a-kind markup system. Now, fifteen years later, we would like to put this data in a form that can be shared on the Web; conversion into a standardized form of markup is needed. The leading standard for the markup of dictionaries is the SGML-based TEI (Text Encoding Initiative) DTD [SMB94]. But using this DTD presents three main problems for this project, because what we really want is to:

This is, in fact, a general problem. The large SGML DTDs in widespread use (e.g. HTML, DocBook, ISO 12083, CALS, EAD, TEI) offer the advantage of standardization, but for a particular project they often carry the disadvantage of being too large or too general. A given project might be better served by a DTD that is no bigger than is needed to solve the specific problem at hand, and that is even customized to meet special requirements of the problem domain. Furthermore, the project might prefer to constrain the data it produces to stay within the bounds of XML conformity.

This paper demonstrates how architectural processing can be used to develop a problem-specific XML DTD for a particular project without losing the advantage of conforming to a widely used SGML DTD. Section 2 begins by elaborating on the three problems mentioned above. The paper uses SGML architectures to develop a solution to these problems. Section 3 gives an overview of architectures. Then section 4 describes exactly how they are used to solve the three problems. Section 5 shows how an SGML parser that includes an architectural processor can be used to simultaneously validate data against both a problem-specific XML DTD and a widely used SGML DTD. Finally, section 6 discusses the issue of TEI conformance. The TEI Guidelines offer an elaborate mechanism for building new applications that are both customized and conforming. The paper concludes by proposing an alternative approach to TEI conformance based on architectures.

2. Three problems

Using one of the large SGML DTDs in widespread use may pose some problems: one might really want to deliver an XML application, one might really want a customized markup scheme that better fits the problem domain, or one might want a small DTD that is no bigger than is really needed for the exact application. These three problems are discussed in more detail in the following three subsections.

2.1. The problem of transition to XML

With the growing popularity of XML and its potential for the publication of structured information on the Web, content developers are wanting to use XML applications. However, our most widely used standards for structuring information are SGML applications like HTML, DocBook, ISO 12083, CALS, EAD, and TEI. A transition to XML poses two kinds of problems.

First, there are problems of XML well-formedness. A data file that is valid with respect to a given SGML application (including both the SGML declaration and the DTD) is likely not to be well-formed XML. For instance, SGML applications typically allow many of the end tags to be omitted, while XML prohibits this. Most SGML applications use '>' to close empty tags and processing instructions, while XML uses '/>' and '?>' respectively. [Cla97] gives a detailed listing of differences between SGML and XML.

Second, there are problems of XML validity. Many features of SGML DTDs are not allowed in XML DTDs. For instance, XML DTDs do not allow the tag minimization characters in element declarations, nor do they support inclusion exceptions or exclusion exceptions in content models. Furthermore, they allow PCDATA in content models only under very restricted conditions. For the sample problem of marking up the Sikaiana dictionary, these differences between SGML DTDs and XML DTDs mean that the TEI DTD cannot be used to deliver a fully XML solution. While the TEI DTD could be used with an SGML parser (with the right SGML declaration) to validate a well-formed XML file, it could not be used with an XML parser.

2.2. The problem of customization

In a particular markup project, it may be desirable, or even necessary, to build a customized DTD. While the large widely-used DTD may handle the essential features needed for the job, it could be that different names may make more sense for certain elements or attributes, or that new elements or attributes need to be added, or that it is more convenient to encode certain combinations of elements with fixed attribute values as new element types.

For instance, in the Sikaiana dictionary it is common for definitions to include embedded words in the Sikaiana language. For instance, consider this entry:

hakamaatele, v. for the chief (aliki) to pray by calling out the names of spirits (aitu).

The TEI DTD prescribes that the definition in this entry be marked up as follows:

<def>for the chief (<foreign lang="SIK">aliki</foreign>),
to pray by calling out the names of spirits
(<foreign lang="SIK">aitu</foreign>)</def>

However, the tag <foreign lang="SIK"> is so common in this application, that one would like to abbreviate it as <sik>. That is,

<def>for the chief (<sik>aliki</sik>) to pray by calling
out the names of spirits (<sik>aitu</sik>)</def>

As another example of where a customization is needed, consider the following entry:

pili, v. to run aground, of a boat or canoe; te vaka ni pili i te popolani, 'the boat ran ashore on the reef'; Idiom: toku vaka ni pili, 'I have made a mess of things (lit., my ship has wrecked)'.

This entry provides two example sentences, the second of which is explicitly marked as an idiom. The following is how the first example is marked up according to the TEI DTD (where <eg> is "example," <q> is "quoted," and <tr> is "translation"):

<eg><q>te vaka ni pili i te popolani,</q>
    <tr>the boat ran ashore on the reef</tr></eg>

But there is no really satisfactory way to mark up the idiom. The TEI has no tag for an idiom, nor does the <eg> element have a type attribute. By adding a type attribute to <eg> and <tr>, we could distinguish a normal example from an idiom and a normal translation from a literal one. For instance,

<eg type="idiom"><q>toku vaka ni pili,</q>
    <tr>I have made a mess of things</tr>
    <tr type="lit">my ship has wrecked</tr></eg>

Even better would be to add two new elements <idiom> and <lit> so that literal translations could be constrained to occur only with idioms. For instance,

<idiom><q>toku vaka ni pili,</q>
    <tr>I have made a mess of things</tr>
    <lit>my ship has wrecked</lit></idiom>

2.3. The problem of "fatware"

Five years ago, a cover story in Byte [PTUM93] decried the problem of "fatware"--software that just keeps getting bigger and bigger with each release without returning commensurate benefit to the user. Niklaus Wirth, in his plea for lean software [Wir95], sums up the situation thus: "Software's girth has surpassed its functionality."

I wonder if we aren't seeing a similar phenomenon with some of our favorite DTDs. Whether they have grown by accretion or were huge by original design, many widely-used DTDs are so large that a typical markup project needs only a fraction of the functionality in the DTD. In the world of software, the average user is much more likely to be successful in using a single-purpose tool that is focussed on the task at hand than in trying to figure out how to apply a multipurpose tool that has more features than are needed for the task [Sim98]. The same must be true in the world of markup: a DTD that is focused on the task at hand must be easier for people to use than a large one that is full of features that will not be applied.

For the Sikaiana dictionary project, the TEI DTD proved to be huge in comparison to the subset of elements and attributes that were actually used. Having a DTD that is limited to just the elements and attributes that are used in a project simplifies many tasks like building project-specific software, specifying stylesheets, shipping the DTD with the data, and documenting markup practice.

Even more significant for the Sikaiana dictionary project than reducing the fat of unused elements and attributes was the matter of reducing the fat of overly permissive content models for the elements that actually were used. In the first case we want to reduce the size of the DTD; in the second case we want to reduce the size of the document space it generates. The TEI's model for dictionary markup is a descriptive one; the DTD aims to provide the user a means of tagging anything that could be encountered in previously published dictionaries. But in tagging the Sikaiana dictionary, our purpose was to be prescriptive; we wanted to specify constraints on how we would structure individual entries and then ensure that all the entries consistently followed that structure.

This point is easy to illustrate. Consider the following entry from the dictionary (where <gramGrp> is "grammatical information group," <pos> is "part of speech," and <re> is "related entry"):

atamai, 1. vs. to be intelligent, skillful, clever, knowledgeable. Hano pe a koe e atamai, aliki ei koe, 'if you are intelligent, you will become chief'. CAUS: hakaatamai 'to instruct, to make intelligent.' 2. n, location. the right side, as opposed to the left: te lima atamai, 'the right hand'; te vahi atamai, 'the right side'. ANT: vvale.

The TEI markup for the first sense is as follows:

<sense n="1">
  <gramGrp><pos>vs</pos></gramGrp>
  <def>to be intelligent, skillful, clever, knowledgeable.</def>
  <eg><q>Hano pe a koe e atamai, aliki ei koe,</q>
      <tr>if you are intelligent, you will become chief</tr></eg>
  <re type="causative">
    <form>hakaatamai</form>
    <def>to instruct, to make intelligent</def></re>
</sense>

The content model for <sense> as it is used in the Sikaiana dictionary is as follows:

( gramGrp, def, (eg | idiom | note)*, usg?, (xr | re)* )

That is, a sense contains an obligatory grammatical information group and definition, followed by optional examples and idioms which may have notes interspersed, an optional usage comment, and optional semantic cross-references and related entries for derivative forms. Contrast this with the content model for <sense> in the TEI DTD:

( sense | %m.dictionaryTopLevel | %m.phrase | #PCDATA )*

where,

<!ENTITY % m.dictionaryTopLevel "def | eg | etym | 
           form | gramGrp | note | re | trans | usg | xr"  > 

The entity reference %m.phrase expands to more than fifty phrase-level elements that can occur in paragraph-level elements throughout the TEI DTD. The result of this content model is that the TEI definition of <sense> allows recursion of senses, inclusion of more than fifty phrase elements, and free-standing PCDATA, all of which we do not want to allow in the Sikaiana dictionary. For the dictionary-specific elements that we do want to use, the TEI DTD has no required elements and puts no constraints on the order of the elements. This situation is far from satisfactory for any particular project that wants to enforce a consistent pattern in the structure of its entries.

3. Architectures to the rescue

These problems can be addressed by using architectural processing. The HyTime standard [ISO92][DD94] first introduced the concept of architectural forms as a way to associate standardized semantics with elements in user-defined DTDs. Since then the concept has been generalized and formally adopted into SGML as part of the SGML Extended Facilities in the 1997 revision of the HyTime standard [ISO97]. An excellent tutorial introduction to SGML architectures can be found in [Kim97]. An in-depth explanation of a particular application of architectures can be found in [Sim97]. See [Cov98] for an up-to-date listing of other resources relating to SGML architectures and their application.

An SGML architecture is an SGML document type that is used as a basis for deriving new document types. (For instance, [Meg98] includes three chapters on how to design new DTDs by deriving them from architectures.) In the same way that a class may be based on a superclass in object-oriented programming, a document type may be based on an architecture. Each of the elements in an architectural DTD is called an architectural form. An architectural form attribute is used on an element of the user document to specify the architectural form on which it is based. For instance, if one were using HTML as an architecture and html as the architectural form attribute, the tag <para html="P"> in a user document would say that this <para> element is derived from (or, inherits the semantics of) HTML's <P> element.

An architecture is defined by a DTD. It is often called a meta-DTD to emphasize its higher level function, but its syntax is just like a normal DTD. We can exploit this fact in solving the problem at hand by using the existing widely-used SGML DTD as an architecture. We then write a problem-specific XML DTD to embody the constraints of the project and use an architectural form attribute to map the elements of the XML DTD onto the elements of the SGML architecture. Corkern [Cor97] proposes the same solution for the corporate setting in which many groups within the company must use the same standardized DTD; each group can comply by having its own "authoring DTD" that is architecturally derived from the "corporate DTD". Fortunately, the architectural processing mechanism has built-in provision for automatically mapping user elements onto architectural elements that have the same name. Thus, when the problem-specific DTD is essentially a subset of a widely-used DTD that is being used as an architecture, very little setup is needed to achieve the right mappings. See section 4.4 for an example.

4. Solving the three problems

The basic strategy is to build a problem-specific XML DTD that declares a widely-used SGML DTD as its base architecture. The following subsections give details of how the three problems of section 2 are addressed by this strategy.

4.1. Addressing the transition to XML

The problem of transition to XML is addressed in two ways. The requirement that the data can be validated by an XML parser is met by having the problem-specific DTD be a valid XML DTD. The requirement to assure XML well-formedness and validity while also assuring validity with respect to the SGML architecture is met by altering the SGML declaration used with the architectural DTD so that it can accept XML syntax in the document instance. In this way, an SGML parser with an architecture engine can parse an XML document and simultaneously validate it against the customized XML DTD and the base SGML DTD.

For the Sikaiana dictionary project, the following two changes needed to be made to the SGML declaration used with the TEI:

4.2. Addressing "fatware"

The problem of fatware is addressed by creating a DTD for the project that omits declarations for all the elements and attributes of the base DTD that are not used. This also requires that content models be revised to no longer reference omitted elements. At the same time, content models should be tightened to embody any additional constraints the project wants to enforce. For instance, elements that are optional in the base DTD could be required in the project DTD, or elements that may occur in free order in the base DTD could be constrained to a particular order in the project DTD. The discussion of <sense> in section 2.3 provides an example.

If a good sample document already exists, an easy way to proceed with creating the customized DTD is to use a DTD generator like FRED [Sha95]. FRED is a free service on the Web that analyzes a submitted SGML document and returns the DTD that is deduced for it. In the case of the Sikaiana dictionary project, the DTD returned by FRED was very close to what we wanted and was easy to modify. By contrast, the TEI DTD was so much bigger and more permissive than what we wanted that starting from scratch would have been easier than trying to edit it.

4.3. Addressing customization

The problem of customization is addressed by modifying the problem-specific XML DTD as needed. Elements that have the same name as the corresponding element in the architectural DTD will automatically map to the right architectural form. Any other elements must be explicitly mapped by using an architectural form attribute. For instance, consider the example from section 2.2 of mapping from the custom tag <sik> to the architectural equivalent <foreign lang="SIK">. Given that tei is the name declared for the architectural form attribute, the customization is achieved by the following definitions in the XML DTD:

<!ELEMENT sik    (#PCDATA) >
<!ATTLIST sik
     tei  NMTOKEN #FIXED "foreign"
     lang CDATA   #FIXED "SIK"   >

The other example from section 2.2 was the discrimination of normal example sentences from idioms; the latter differ in that they require different display formatting and may include a literal translation. This part of the XML DTD would be coded as follows:

<!ELEMENT idiom  (q, tr, lit?) >
<!ATTLIST idiom
     tei  NMTOKEN #FIXED "eg"  >
<!ELEMENT q      (#PCDATA)     >
<!ELEMENT tr     (#PCDATA)     >
<!ELEMENT lit    (#PCDATA)     >
<!ATTLIST lit
     tei  NMTOKEN #FIXED "tr"  >

These declarations say that the custom elements <idiom> and <lit> are really just specializations of the TEI elements <eg> and <tr>. The declarations for <q> and <tr> require no ATTLIST declaration to associate them with an architectural form since the architectural processing mechanism automatically associates an element with an architectural form of the same name.

4.4. A complete example

The complete problem-specific DTD for the Sikaiana dictionary project is as follows:

<!-- sikaiana.dtd                                -->
<!-- XML DTD for the Sikaiana dictionary project -->

<!ELEMENT SikDict       (teiHeader, text) >
<!ATTLIST SikDict  tei  NMTOKEN #FIXED "TEI.2" >
<!ELEMENT text          (front, body)     >
<!--  Declarations for teiHeader and
      front are omitted to save space   -->

<!ELEMENT body     (entry+) >
<!ELEMENT entry    ( form+, ( xr | note+ | 
                     (etym?, sense+, (xr | re)*) )) >
<!ATTLIST entry    id     ID     #REQUIRED  
                   n      CDATA  #IMPLIED >

<!ELEMENT form     (#PCDATA) >
<!ATTLIST form     type   (headword|alternate) "headword" >

<!-- Etymology -->
<!ELEMENT etym     (#PCDATA | lang | mentioned | gloss )*  >
<!ELEMENT lang     (#PCDATA) ><!-- language name, abbrev -->
<!ELEMENT mentioned  (#PCDATA) ><!-- a source form       -->
<!ELEMENT gloss    (#PCDATA) ><!-- gloss of source form  -->

<!-- Senses of meaning -->
<!ELEMENT sense    ( gramGrp, def, (eg | idiom | note)*,
                     usg?, (xr | re)* ) > 
<!ATTLIST sense    n      CDATA  #IMPLIED  >

<!-- Grammatical information -->
<!ELEMENT gramGrp  ( pos, gram? ) >
<!ELEMENT pos      (#PCDATA) ><!-- part of speech        -->
<!ELEMENT gram     (#PCDATA) ><!-- further grammar note  -->
<!ATTLIST gram     type   CDATA  #FIXED  "note"  >

<!-- Definitions -->
<!ELEMENT def      ( #PCDATA | sik )* >
<!ELEMENT sik      (#PCDATA) ><!-- Sikaiana word(s)      -->
<!ATTLIST sik      tei    NMTOKEN  #FIXED "foreign"
                   lang   CDATA    #FIXED "SIK"     >
<!ELEMENT usg      (#PCDATA) ><!-- a usage note          -->

<!-- Examples -->
<!ELEMENT eg       ( q, tr, usg? ) >
<!ELEMENT idiom    ( q, tr, lit? ) >
<!ATTLIST idiom    tei    NMTOKEN  #FIXED "eg" >
<!ELEMENT q        (#PCDATA) ><!-- the example text      -->
<!ELEMENT tr       (#PCDATA) ><!-- free translation      -->
<!ELEMENT lit      (#PCDATA) ><!-- literal translation   -->
<!ATTLIST lit      tei    NMTOKEN  #FIXED "tr" >
<!ELEMENT note     (#PCDATA) >

<!-- Cross-references -->
<!ELEMENT xr       (ptr+) ><!-- a semantic cross-ref.    -->
<!ATTLIST xr       type   
                   ( seeAlso | antonym | generic
                   | contrast | causative | transitive
                   | whole | synonym | other | stative )
                   #REQUIRED >
<!ELEMENT ptr      EMPTY  ><!-- a pointer to an entry    -->
<!ATTLIST ptr      target CDATA  #REQUIRED >
<!-- The target is CDATA so that files for individual
     letters of the alphabet can be validated without
     being swamped by missing IDs.  In TEI, target is
     IDREF, so that full validation of cross-references
     occurs when parsed with the -Atei option.           -->
<!-- Related entry; i.e., a grammatical derivative       -->
<!ELEMENT re       ( form, gramGrp?, def? ) >
<!ATTLIST re       type
                   ( singular | other | causative
                   | passive | plural | repeatedAction
                   | stative | oneTimeAction | transitive)
                   #REQUIRED >

In comparing this DTD to the full TEI DTD, we see a situation that is like the difference between a single-purpose software tool and a general-purpose tool. In software, a key technology for building task-centered applications is to use a scripting language to build many single-purpose tools around a single many-purpose component [Sim98]. Analogously in markup, a key technology for building task-centered applications is to use architectural processing to map many single-purpose DTDs onto a single many-purpose DTD.

5. Validating data against both DTDs

In order to employ this technique of building a problem-specific XML DTD that is derived from a widely-used SGML DTD, one must use an SGML parser that incorporates a full architectural processing engine. The SP parser by James Clark [Cla98] is an example of such a parser.

First, we want to validate our project documents against just the problem-specific DTD. Use the -w xml command line option to run the SP parser in XML mode. In this mode, the parser issues warnings about anything in the DTD that is not valid in XML. For instance,

nsgmls -w xml -c xml.soc myData.xml

where xml.soc is an SGML Open catalog containing:

SGMLDECL xml.dcl
DOCTYPE  SikDict sikaiana.dtd

That is, the standard SGML declaration for XML (supplied with the SP parser) is used and sikaiana.dtd (see section 4.4) is the DTD that is used when the document element is <SikDict>.

Second, we want to use architectural processing to validate our project documents against the TEI DTD as well. The secret to setting up the parser to use architectural processing is to insert a declaration of the base architecture into the DTD. For this purpose, we create a new version of the project DTD named sik_tei.dtd:

 
<!-- sik_tei.dtd
     Sikaiana Dictionary Project
     DTD for mapping to TEI with SP parser-->

<!-- First, declare that this application is
     based on the TEI architecture -->

<?IS10744 ArcBase tei ?>

<!ENTITY % teiDTD SYSTEM "mypizza.dtd" >
<!NOTATION tei SYSTEM>
<!ATTLIST #NOTATION tei
     arcDocF  NAME #FIXED TEI.2
     arcFormA NAME #FIXED tei
     ArcDTD  CDATA #FIXED "%teiDTD" >

<!-- Now declare the elements of the Sikaiana
     dictionary application -->

<!ENTITY % sikaianaDTD SYSTEM "sikaiana.dtd">
%sikaianaDTD;

A base architecture named tei is declared by means of the processing instruction. Following this is the architectural support declaration. It consists of a notation declaration and an attribute definition list that sets options which control the architecture engine. In this case, ArcDocF specifies the generic identifier for the document element of the architectural document, arcFormA identifies the architectural form attribute, and ArcDTD specifies the file which contains the architectural DTD. Mypizza.dtd is a customized DTD downloaded from the TEI "Pizza Chef" [OUCS98]; it is a 98K file containing just the core tag set and the base tag set for dictionaries. Finally, the problem-specific DTD from section 4.4 is included without change; thus we still have a DTD for the document type SikDict.

Further, we create an alternative SGML Open catalog named tei.soc to use when we want to validate against both the problem-specific DTD and the TEI DTD. It uses the modified version of the TEI SGML declaration created in section 4.1 and the version of the problem-specific DTD that sets up architectural processing. That is,

SGMLDECL tei_xml.dcl
DOCTYPE  SikDict sik_tei.dtd

Running the SP tools with the -A command line parameter invokes architectural processing for the named architecture. Thus,

nsgmls -A tei -c tei.soc myData.xml

causes the document to be validated against both the problem-specific DTD and the architectural DTD. Note that the sgmlnorm member of the SP family can go even a step further to translate a project document into the equivalent architectural document.

6. An architectural approach to TEI conformance

The TEI Guidelines devote one chapter to the issue of conformance and another to mechanisms for modifying the DTD in a conforming manner [SMB94]. As the guidelines explain, the target uses of the DTD demanded that extension be possible:

The document type declaration provided by the TEI is intended to cover as wide a variety of document types and processing needs as proved feasible. It is impossible, however, for any finite list of text elements to cover every need of textual research and processing. As a result, extension of the TEI DTD has no effect on strict TEI conformance, as long as certain restrictions are observed; these have the effect of ensuring that later users of a file can easily see what changes have been made to the DTDs and what the new tags are intended to mean. [Section 28.5.3]

An extended TEI DTD is TEI conformant if it meets two basic requirements: (1) all extensions are documented in a prescribed way, and (2) all modifications are made in the DTD subset of the document (that is, the actual TEI DTD files may not be modified). To support DTD modification via the DTD subset, the TEI DTD was implemented using an ingenious system of entities:

In short, virtually any change (including wholesale redefinition) is conformant, as long as it is done using the prescribed mechanisms. Such a liberal view of conformance is probably troubling to most. The guidelines partially address this in section 29.1 by defining two classes of modifications: clean modifications versus unclean modifications. The implication is that the former are preferred over the latter:

clean modification
A modification is clean if the set of documents parsed by the original DTD may be properly contained in the set of documents parsed by a modified DTD, or vice versa.
unclean modification
A modification is unclean if the set of documents parsed by the original DTD overlaps the set of documents parsed by the modified DTD with neither being properly contained in the other.

Note that a modification that renames an element without creating a conflict with an existing element name is considered clean (section 29.1.2) since the set of documents matching the modified DTD is isomorphic to the set of documents matching the original DTD.

The TEI DTD was developed before the notion of SGML architectures was generalized. Had architectures existed, the TEI could have avoided devising its elaborate system of extension by adopting an architectural approach to conformance. Such an approach might work something like the following.

The TEI notion of original DTD corresponds to the architectural DTD and the TEI notion of modified DTD corresponds to the derived problem-specific DTD. A problem-specific DTD would be TEI conformant if it declared the TEI DTD to be its base architecture. Such a definition is comparable in its liberality to the TEI's definition. What is more significant is the distinction between clean and unclean conformance and the contribution the architectural approach can make to that question.

In the TEI approach to conformance, the notion of unclean versus clean has a formal definition in terms of the overlap or non-overlap of the two sets of documents matched by the two DTDs. In the TEI approach, the SGML parser cannot validate a modification as being clean or not; this is simply a matter for the DTD designer to reason about. The architectural approach, however, can change this. For both documents and DTDs, we could define two kinds of conformance in terms of parser behavior as follows:

clean conformance
A document conforms cleanly to its base architecture if its corresponding architectural document is valid with respect to the architectural DTD. A problem-specific DTD conforms cleanly to its base architecture if every document that is valid for that DTD also conforms cleanly to the base architecture.
unclean conformance
A document conforms uncleanly to its base architecture if its corresponding architectural document is not valid with respect to the architectural DTD. A problem-specific DTD conforms uncleanly to its base architecture if there is at least one document that is valid for that DTD but which does not conform cleanly to the base architecture.

This definition of clean conformance has essentially the same coverage as the TEI definition. The TEI definition has three basic cases which correspond as follows in the architectural approach:

This architectural approach to defining clean conformance has a major advantage over the TEI approach, namely, the SGML parser can formally test clean conformance for any user document. By simultaneously validating a document against its own DTD and its architectural DTD, clean conformance is achieved when no errors are reported for either DTD. When a document is valid against its own DTD, but generates errors with respect to the architectural DTD, then it is unclean conformance. When this happens there are two cases:

This approach does have one major weakness: the SGML parser can only verify that a particular document instance conforms to the architecture; it cannot verify that the problem-specific DTD conforms in the general case to the architectural DTD. That is, there is no way to ensure in advance that a particular problem-specific DTD only accepts documents that are also architecturally valid. For a case like the Sikaiana dictionary project, in which there is a closed set of data files, and we can easily validate them all against both DTDs, this limitation does not pose a problem. On the other hand, in a case like an industrial setting, where a run-time validation error could bring production to a screeching halt, this limitation could be a serious one.

7. Conclusion

The prototypical use of architectural processing has been to annotate one DTD with respect to the forms (or semantic elements) of another. This paper has demonstrated the application of architectural processing for a different purpose, namely, to indirectly validate a small DTD developed for a particular project against a large widely-used DTD that it is meant to be based on. By using this technique, a DTD developer can enjoy the benefits of a customized XML DTD without losing the benefits of the intellectual effort that went into developing the widely-used SGML DTD. By the same token, a project can have the advantages of delivering a customized XML application without losing the advantages of conforming to one of the widely-used SGML applications.

References

[Cla97] Clark, J. (1997) "Comparison of SGML and XML," World Wide Web Consortium NOTE-sgml-xml-971215. <http://www.w3.org/TR/NOTE-sgml-xml>.

[Cla98] Clark, J. (1998) SP:An SGML System Conforming to International Standard ISO 8879 --Standard Generalized Markup Language, version 1.3. <http://jclark.com/sp/>. See especially "Architectural form processing," <http://jclark.com/sp/archform.htm>.

[Cor97] Corkern, C. (1997) "From architectures to authoring DTDs," SGML/XML '97 Conference Proceedings, pages 263-268. Alexandria, VA: Graphic Communications Association.

[Cov98] Cover, R. (1998) "Architectural Forms and SGML/XML Architectures," in The SGML/XML Web Page. <http://www.oasis-open.org/cover/topics.html#archForms>.

[DD94] DeRose, S. and Durand, D. (1994) Making Hypermedia Work: A User's Guide to HyTime. Boston: Kluwer Academic Publishers. See especially pages 79-90.

[Don87] Donner, W. (1987) Sikaiana Vocabulary: Na male ma na talatala o Sikaiana. Honiara, Solomon Islands: published by the author through a grant from the South Pacific Cultures Fund of the Australian government. 267 pp.

[ISO92] International Organization for Standardization. (1992) ISO/IEC 10744. Hypermedia/Time-based Structuring Language: HyTime.

[ISO97] International Organization for Standardization. (1997) "Architectural Form Definition Requirements (AFDR)," Annex A.3 of ISO/IEC N1920, Information Processing--Hypermedia/Time-based Structuring Language (HyTime), Second edition 1997-08-01. <http://www.ornl.gov/sgml/wg8/docs/n1920/html/clause-A.3.html>.

[Kim97] Kimber, W. E. (1997) "A tutorial introduction to SGML architectures," an ISOGEN International Corporation workpaper. <http://www.isogen.com/papers/archintro.html>.

[Meg98] Megginson, D. (1998) Structuring XML Documents. Charles F. Goldfarb Series on Open Information Management. Upper Saddle River, NJ: Prentice Hall.

[OUCS98] Oxford University Computing Services, Humanities Computing Unit. (1998) "The Pizza Chef: a TEI tag set selector," an interactive service on the Web. <http://www.oucs.ox.ac.uk/humanities/TEI/pizza.htm>.

[PTUM93] Perratore, E., T. Thompson, J. Udell, and R. Malloy. (1993) "Fighting fatware," Byte (April 1993), pp. 98-108.

[Sha95] Shafer, K. (1995) "Creating DTDs via the GB-Engine and Fred," a paper presented at SGML '95. <http://www.oclc.org/fred/docs/sgml95.html>. The software is available at <http://www.oclc.org/fred/>.

[Sim97] Simons, G. (1997) "Using architectural forms to map SGML data into an object-oriented database," SGML/XML '97 Conference Proceedings, pages 449-459. Alexandria, VA: Graphic Communications Association. A fuller workpaper is available at <http://www.sil.org/cellar/import/>.

[Sim98] Simons, G. (1998) "In search of task-centered software: building single-purpose tools from multipurpose components," SIL Electronic Working Paper 1998-004. Dallas: Summer Institute of Linguistics. <http://www.sil.org/silewp/1998/004/>.

[SMB94] Sperberg-McQueen, C. M. and L. Burnard (eds.). (1994) Guidelines for Electronic Text Encoding and Interchange. Chicago and Oxford: Text Encoding Initiative. <http://www-tei.uic.edu/orgs/tei/p3/elect.html>. See especially chapter 12, "Print dictionaries," chapter 28, "Conformance," and chapter 29, "Modifying the TEI DTD."

[Wir95] Wirth, N. (1995) "A plea for lean software," IEEE Computer (February 1995), pp. 64-68.


Date created: 29-Dec-1998
URL: http://www.sil.org/silewp/1998/006/silewp1998-006.html
Questions/Comments: SILEWP@sil.org


[SILEWP 1998 Contents | SILEWP Home | SIL Home]