Importing SGML data into CELLAR by means of architectural forms
Gary F. Simons
Summer Institute of Linguistics
Release date: 12 November 1997
Last revised: 15 December 1997
This working paper documents a process for importing SGML data into the CELLAR database. The process, which requires no change to the SGML data and no special-purpose programming on the CELLAR side, is based on a relatively new SGML feature named architectural forms. The user writes a meta-DTD that maps the elements in the SGML data onto architectural forms that express the corresponding objects and attributes in CELLAR. Then an SGML parser uses this to create an "architectural document" that an existing CELLAR parser reads to build the corresponding structure of objects in the CELLAR database.
- The SGML model versus the object model
- An overview of the process for importing SGML data into the CELLAR database
- A step-by-step guide to running the process
- The architecture for mapping SGML data into CELLAR objects
- Solutions to common mapping problems
- Complete examples
- Implementation within CELLAR
I am deeply indebted to my colleague Robin Cover who has helped in many ways over the course of this project. He has gone the extra mile in helping me to find resources and in offering useful feedback and encouragement.
Much of the promise of SGML lies in the fact that descriptively marked up data can be used by multiple applications. Given the fact that an SGML DTD has much in common with the conceptual model that results from an object-oriented analysis of a problem domain, it is logical to conclude that SGML data should be particularly amenable to being imported into software that uses an object-oriented data model. This is not a trivial task, however, since there are some fundamental differences between the SGML model of data and the object model.
This working paper explores that general problem as it develops a solution to a more specific problem, namely, how to import existing SGML data into an existing object-oriented database schema without changing either the SGML data or the database schema. The target system is an object-oriented database system named CELLAR (for Computing Environment for Linguistic, Literary, and Anthropological Research [RST93] [Sim97a]). The solution uses architectural processing to map the SGML data onto architectural forms that the CELLAR system can use to construct the corresponding structure of objects.
Section 2 of the paper discusses the basic differences between the SGML model of data and the object model, and illustrates why the mapping from SGML elements to objects is not a trivial one. Section 3 gives an overview of the solution developed in this paper by explaining how architectural processing works, while section 4 gives detailed step-by-step instructions on how to import SGML data into the CELLAR database. Section 5 presents the complete architecture for mapping from SGML elements to CELLAR objects. Section 6 documents the architecture by giving solutions to common mapping problems, while section 7 does so with a set of complete examples. Finally, section 8 explains the implementation on the CELLAR side and section 9 offers concluding remarks.
2. The SGML model versus the object model
The problems inherent in importing SGML data into an object database stem from the differences between the SGML model of data and the object model of data. The fundamental problem is that some elements in the SGML data correspond to objects, while others correspond to attributes, and still others correspond to both. Architectural forms offer a means of encoding the semantics of these relationships. The problem of mapping from the SGML model to the object model and how architectural forms can be used to bridge the gap is developed in a separate page.
3. An overview of the process for importing SGML data into the CELLAR database
The method for importing SGML data into the CELLAR database is based on architectural forms. The HyTime standard [ISO92] first introduced the concept of architectural forms as a way to associate standardized semantics with elements in user-defined DTDs [DD94]. Now that this notion has been generalized in the SGML Extended Facilities (defined in Annex A of the revised HyTime standard [ISO97]), we can use it to good advantage in solving the problem at hand. Architectural forms provide a mechanism we can use to express the semantics of how SGML elements map onto the object model. See [Cov97] for pointers to other applications of architectural forms.
To understand the process for importing SGML data into CELLAR, one must first understand how architectural processing works. Normally, an SGML parser reads an input document with its DTD and either validates the document or produces a normalized output representation of it. In the case of architectural processing, the SGML parser reads an input document with its DTD (called the client document and the client DTD) and produces an output document that conforms to a different DTD (called the architectural document and architectural DTD). The following diagram gives a graphical overview of the process:
With the nsgmls parser from the SP package [Cla97], architectural processing is invoked by giving the -A command line option. Following the -A is the name of the architecture to use. The name must be declared in an ARCBASE processing instruction in the client document; following this is the architectural support declaration which tells the parser how to process the architecture. (The notation is grossly simplified in the diagram.) This declaration specifies the DOCTYPE of the architectural document (arcDocF), the architectural DTD (arcDTD), and the architectural form attribute (arcFormA). The latter names the attribute in the client document whose value contains the element type for the corresponding element in the architectural document. In the diagram, arch is specified as the architectural form attribute. Thus <client arch=target> means that the corresponding element in the architectural document is <target>. When the parser is performing architectural processing, it not only translates the elements of the client document into the corresponding elements of the architecture; it also validates the architectural document being produced against the architectural DTD to ensure that the ouput is a valid document in that architecture. A DTD that defines an architecture is also known as a meta-DTD.
The process diagrammed above requires that the client document be already annotated with the values of the architectural form attribute and other architectural attributes. This, however, violates our basic requirement that the process for importing SGML data should not require that we change the SGML data file. This problem can be solved by performing architectural processing twice, first to add the architecural attributes to the client document and then to create the architectural document.
The nsgmls parser can do this in one pass by specifying two -A options on the command line. The following diagram illustrates what happens in a case like this:
In the first step, the architecture named mapping is invoked. It uses a mapping DTD to supply the architectural form attribute and any other architectural attributes for the elements of the client document. The result is an architecturally annotated version of the client document; this document is virtual in that it is never written to a file. (It can be written by running just nsgmls -Amapping.) The architecture named cellar is then invoked with the virtual document as input. The result is the corresponding document that conforms to the DTD for the CELLAR architecture. The next section gives step-by-step instructions on how to set up and run the process for any given SGML data file.
4. A step-by-step guide to running the process
The step-by-step instructions are provided as a separate page so that it can be used as stand-alone documentation by those who are running the process.
5. The architecture for mapping SGML data into CELLAR objects
The architecture for mapping SGML data into CELLAR objects is embodied in the file cellar.dtd. The latest version of that file is always the latest authority for defining the architecture. The DTD contains brief comments for every element and attribute in the architecture, thus it also functions as the basic reference documentation on the architecture.
At this point, there is no other reference documentation on an element-by-element and attribute-by-attribute basis. The primary documentation is more tutorial and applied, organized as solutions to common mapping problems. Where the comments in the DTD or the solved problems do not provide enough information, the user is referred to the complete examples. Searching through the mapping DTDs in all the examples should yield examples of any particular architectural attribute being used.
6. Solutions to common mapping problems
A guide to common mapping problems and their solutions is given in a separate page. Use this page as a reference when you are trying to build a mapping DTD.
7. Complete examples
The following complete, working examples are provided:
Download the nsgmls parser in order to run them yourself.
TEI Lite is a subset of the Text Encoding Initiative's (TEI) full DTD. It is suitable for marking up many documents. See http://www.uic.edu/orgs/tei/lite/. In the example, two TEI Lite documents have been mapped into the class Article in the Cellar database. (This class has been used to implement many of the reference resources in LinguaLinks [SIL97].) The mapping DTD does not handle the entire TEI Lite DTD; rather, it only maps the elements that occur in the two sample documents.
|Sample documents||ceth9605.sgm -- a trip report
edw68.sgm -- a TEI work paper
|Command lines||nsgmls -Amapping -Acellar dec1.tei ceth9605.sgm
nsgmls -Amapping -Acellar dec1.tei edw68.sgm >output
This example uses two papers from the electronic proceedings of the Graphic Communications Association's SGML '96 conference. Again, these are mapped into the Article class used in LinguaLinks [SIL97].
|Command lines||nsgmls -Amapping -Acellar xml1.dec fuchs.sgm
nsgmls -Amapping -Acellar xml1.dec kimber.sgm >output
The SGML data file in this example is a critical edition in TEI markup of a passage from the Second Epistle of Clement. A fuller treatment of this sample text along with examples of what can be done with it in the CELLAR environment is given in [Sim97a]. See Chapter 19: Critical Apparatus of the TEI Guidelines [TEI94] for an explanation of the markup. (This text uses the "Parallel Segmentation Method;" see section 19.2.3.) This is the example used in the step-by-step instructions where a diagram of the conceptual model is given.
|Command line||nsgmls -Amapping -Acellar dec1.tei clement.sgm >clement.clr|
|Domain model||CriticalText.dom -- This dump file must first be loaded into CELLAR to define the classes for the Critical Text domain.|
8. Implementation within CELLAR
This approach to importing SGML data into the CELLAR database has been implemented within the CELLAR system as a data import parser. The input to the CELLAR parser is the ESIS output file of the nsgmls parser. At the heart of the implementation is a recursive function of about 135 lines (excluding comments) that processes one element at a time from the ESIS stream. This function relies on another 125 lines of code in smaller supporting functions. The source code for this parser is listed in full and explained in an accompanying page.
The full implementation is included in the CELLAR knowledge base that is shipped with the version 2.0 release of LinguaLinks. Download SGML97Patches.dom for patches to the 2.0 release. To date these updates include:
- Support for setting the encoding of Text, not just String
To load the patches, open a CELLAR object inspector on the RootFolder. Give the Dump / Load After Selection command and specify SGML97Patches.dom as the file to load. This is a DomainModel that loads the updates as relatedObjects. Once it has loaded, the patches are installed and you may delete the SGML97Patches DomainModel from the RootFolder.
The results to date have been promising. The goal of developing a general solution to the problem of importing SGML data into an existing object database schema has been achieved. Given the fact that the method permits superfluous markup to be ignored and unmappable elements to be discarded altogether, it is always possible to achieve a translation from an SGML file into a structure of objects in the database. The usefulness of the result depends on the degree of congruence between the conceptual model of the markup for the source data in SGML and that of the schema for the target object database.
[Bor85] Borgida, A. (1985) Features of languages for the development of information systems at the conceptual level. IEEE Software 2(1): 63-72.
[Cat97] Cattell, R.G.G., et al. (1997) The Object Database Standard 2.0. San Francisco: Morgan Kaufman.
[Cla97] Clark, J. (1997) SP:An SGML System Conforming to International Standard ISO 8879 --Standard Generalized Markup Language, version 1.2. <http://jclark.com/sp/>. See especially "Architectural form processing," <http://jclark.com/sp/archform.htm>.
[DD94] DeRose, S. and Durand, D. (1994) Making Hypermedia Work: A User's Guide to HyTime. Boston: Kluwer Academic Publishers. See especially pages 79-90.
[ISO92] International Organization for Standardization. (1992) ISO/IEC 10744. Hypermedia/Time-based Structuring Language: HyTime.
[ISO97] International Organization for Standardization. (1997) Architectural Form Definition Requirements (AFDR), Annex A.3 of ISO/IEC N1920, Information Processing--Hypermedia/Time-based Structuring Language (HyTime), Second edition 1997-08-01. <http://www.ornl.gov/sgml/wg8/docs/n1920/html/clause-A.3.html>.
[RST93] Rettig, M., Simons, G., and Thomson, J. (1993) Extended Objects. Communications of the ACM 36(8):19-24.
[Sim97a] Simons, G. (1997) Conceptual modeling versus visual modeling: a technological key to building consensus. Computers and the Humanities 30(4):303- 319. See the longer working paper version at <http://www.sil.org/cellar/ach94/ach94.html>.
[Sim97b] Simons, G. (1997) Using architectural forms to map SGML data into an object-oriented database, in Proceedings of SGML/XML '97, Washington, D. C., 8-11 December 1997. See <http://www.gca.org/conf/sgml97/> for conference information.
[Sim97c] Simons, G. (1997) Using architectural forms to map TEI data into an object-oriented database, in TEI Tenth Anniversary Users' Conference: Conference Abstracts, Providence, R.I., 14-16 November 1997.See <http://www.stg.brown.edu/webs/tei10/> for conference information.
[ST97] Simons, G., and Thomson, J. (in press) Multilingual data processing in the CELLAR environment. To appear in John Nerbonne (ed.), Linguistic Databases. Stanford, CA: Center for the Study of Language and Information. (The original working paper is available at <http://www.sil.org/cellar/mlingdp/mlingdp.html>.)
[TEI94] Sperberg-McQueen, C. M. and Burnard, L. (1994) Guidelines for Electronic Text Encoding and Interchange. Chicago and Oxford: Text Encoding Initiative. <http://www-tei.uic.edu/orgs/tei/p3/elect.html>.
Document date: 12-Nov-1997