PC-KIMMO: A Two-level Processor for Morphological Analysis
PC-KIMMO is a new implementation for microcomputers of a program dubbed KIMMO after its inventor Kimmo Koskenniemi (see Koskenniemi 1983). It is of interest to computational linguists, descriptive linguists, and those developing natural language processing systems. The program is designed to generate (produce) and/or recognize (parse) words using a two-level model of word structure in which a word is represented as a correspondence between its lexical level form and its surface level form.
Work on PC-KIMMO began in 1985, following the specifications of the LISP implementation of Koskenniemi's model described in Karttunen 1983. The coding has been done in Microsoft C by David Smith and Stephen McConnel under the direction of Gary Simons and under the auspices of the Summer Institute of Linguistics. The aim was to develop a version of the two-level processor that would run on an IBM PC compatible computer and that would include an environment for testing and debugging a linguistic description. The PC-KIMMO program is actually a shell program that serves as an interactive user interface to the primitive PC-KIMMO functions. These functions are available as a C-language source code library that can be included in a program written by the user.
A PC-KIMMO description of a language consists of two files provided by the user:
- a rules file, which specifies the alphabet and the phonological (or spelling) rules, and
- a lexicon file, which lists lexical items (words and morphemes) and their glosses, and encodes morphotactic constraints.
The theoretical model of phonology embodied in PC-KIMMO is called two-level phonology. In the two-level approach, phonological alternations are treated as direct correspondences between the underlying (or lexical) representation of words and their realization on the surface level. For example, to account for the rules of English spelling, the surface form spies must be related to its lexical form `spy+s as follows (where ` indicates stress, + indicates a morpheme boundary, and 0 indicates a null element):
Lexical Representation: ` s p y + 0 s Surface Representation: 0 s p i 0 e s
Rules must be written to account for the special correspondences `:0, y:i, +:0, and 0:e. For example, the two-level rule for the y:i correspondence looks like this (somewhat simplified):
y:i => @:C___+:0
Notice that the environment of the rule is also specified as a string of two-level correspondences. Because two-level rules have access to both underlying and surface environments, interactions among rules can be handled without using sequential rule ordering. All of the rules in a two-level description are applied simultaneously, thus avoiding the creation of intermediate levels of derivation (an artifact of sequentially applied rules).
The two functional components of PC-KIMMO are the generator and the recognizer. The generator accepts as input a lexical form, applies the phonological rules, and returns the corresponding surface form. It does not use the lexicon. The recognizer accepts as input a surface form, applies the phonological rules, consults the lexicon, and returns the corresponding lexical form with its gloss. Figure 1 shows the main components of the PC-KIMMO system.
Figure 1: Main components of PC-KIMMO +-----------+ +-----------+ | RULES | | LEXICON | +----+------+ +------+----+ |-------+ +-------| | | v v Surface Form: +------------------+ Lexical Form: spies ------->| Recognizer |----> `spy+s +----+-------------+ [N(spy)+PLURAL] | v +------------------+ spies <-------| Generator |<----- `spy+s +------------------+
The rules and lexicon are implemented computationally using finite state machines. For example, the two-level rule shown above for the y:i correspondence must be translated into the following finite state table for PC-KIMMO to use:
|@ y + @ |C i 0 @ --+------- 1:|2 0 1 1 2:|2 3 2 1 3.|0 0 1 0
(Note: as of May 1991, there is a beta test vesion of a rule compiler available, called KGEN. See below for more information.)
Around the components of PC-KIMMO shown in figure 1 is an interactive shell program that serves as a user interface. When the PC-KIMMO shell is run, a command-line prompt appears on the screen. The user types in commands which PC-KIMMO executes. The shell is designed to provide an environment for developing, testing, and debugging two-level descriptions. Among the features available in the user shell are:
- on-line help;
- commands for loading the rules and lexicon files;
- ability to generate and recognize forms entered interactively from the keyboard;
- a mechanism for reading input forms from a test list on a disk file and comparing the output of the processor to the correct results supplied in the test list;
- provision for logging user sessions to disk files;
- a facility to trace execution of the processor in order to debug the rules and lexicon;
- other debugging facilities including the ability to turn off selected rules, show the internal representation of rules, and show the contents of selected parts of the lexicon; and
- a batch processing mode that allows the shell to read and execute commands from a disk file.
Because the PC-KIMMO user shell is intended to facilitate development of a description, its data-processing capabilities are limited. However, PC-KIMMO can also be put to practical use by those engaged in natural language processing. The primitive PC-KIMMO functions (including load rules, load lexicon, generate, recognize) are available as a source code library that can be included in another program. This means that the users can develop and debug a two-level description using the PC-KIMMO shell and then link PC-KIMMO's functions into their own programs.
Up until now, implementations of Koskeniemmi's two-level model have been available only on large computers housed at academic or industrial research centers. As an implementation of the two-level model, PC-KIMMO is important because it makes the two-level processor available to individuals using personal computers. Computational linguists can use PC-KIMMO to investigate for themselves the properties of the two-level processor. Theoretical linguists can explore the implications of two-level phonology, while descriptive linguists can use PC-KIMMO as a field tool for developing and testing their phonological and morphological descriptions. Teachers of courses on computational linguistics can use PC-KIMMO to demonstrate the two-level approach to morphological parsing. Finally, because the source code for the PC-KIMMO's generator and recognizer functions is made available, those developing natural language processing language processing applications (such as a syntactic parser) can use PC-KIMMO as a morphological front end to their own programs.
(Note: as of May 1991, a program called KTEXT is available. It uses the PC-KIMMO parser and processes text, producing a morphological parse of each word in the text. See below for more information.)
PC-KIMMO will run on the following systems:
There are two versions of the PC-KIMMO release, one for IBM PC compatibles and one for the Macintosh. Each contains the executable PC-KIMMO program, examples of language descriptions, and the source code library for the primitive PC-KIMMO functions. The PC-KIMMO executable program and the source code library are copyrighted but are made freely available to the general public under the condition that they not be resold or used for commercial purposes.
It should be noted that the Macintosh version retains the DOS/UNIX command-line interface rather than using the graphical user interface one expects from Macintosh programs.
For those who wish to compile PC-KIMMO for their UNIX system, we will supply the full sources. But you must also obtain either the IBM PC version or the Macintosh version in order to get all the sample files.
The PC-KIMMO release contains the executable PC-KIMMO program, the function library, and examples of PC-KIMMO descriptions for various languages, including English, Finnish, Japanese, Hebrew, Kasem, Tagalog, and Turkish. These are not comprehensive linguistic descriptions, rather they cover only a selected set of data.
Versions of PC-KIMMO and KTEXT are now available for PC's running a 386 CPU (or higher). The main advantage of these 386 versions is that they will use all available extended/expanded memory and will also use virtual memory. This means that you can load a lexicon that is larger than 640K (such as Englex described below). These 386 versions support VCPI-compliant memory managers such as MS-DOS 5.0's EMM386 and Quarterdeck's QEMM386. They do not support Microsoft Windows.
KGEN, a rule compiler for PC-KIMMO, was written by Nathan Miles of Ohio State University.
KGEN takes a two-level rule like this:
y:i => @:C___+:0
and translates it into a finite state table like this:
@ y + @ C i 0 @ 1: 2 0 1 1 2: 2 3 2 1 3. 0 0 1 0
KGEN accepts as input a file of two-level rules and produces as output a file of state tables that is identical in format to PC-KIMMO's rules file. Anything that KGEN does not correctly handle can be easily fixed by hand in its output file. KGEN runs under MS-DOS, UNIX, and Macintosh.
KTEXT is a new text-processing application that uses the PC-KIMMO parser. It accepts as input a text in orthographic form, tokenizes it into words, strips off and saves punctuation, capitalization, white space, and formatting codes, parses each word, and outputs the result to a quasi-database file with a record for each word. Its output data structures are suitable for further processing by other programs, such as a text interlinearizer, a syntactic parser, or a machine translation system. KTEXT is available for MS-DOS, Macintosh, and UNIX.
Englex is a morphological parsing lexicon of English intended for use with PC-KIMMO and/or KTEXT. Its 20,000 entries consist of affixes, roots, and indivisible stems. Both inflectional and derivational morphology are analyzed. Englex will run under Unix, Macintosh, or MS-DOS (the files are plain ascii and are identical for all three versions). Because of memory requirements, to run Englex under MS-DOS you will need a 386 cpu and the new 386 versions of PC-KIMMO and KTEXT. These 386 versions will use all available extended/expanded memory and virtual memory. They support VCPI-compliant memory managers such as DOS 5.0's EMM386 and Quarterdeck's QEMM. They do not support (or need) Windows.
For those who would like to try out PC-KIMMO for free, it is available for download at PC-Parse download archives
KGEN is available for downloading:
Englix is available for downloading.
Antworth, Evan L. 1990. PC-KIMMO: a two-level processor for morphological analysis. Occasional Publications in Academic Computing No. 16. Dallas, TX: Summer Institute of Linguistics. ISBN 0-88312-639-7, 273 pages, paperbound.
____. 1993. Glossing text with the PC-KIMMO morphological parser. Computers and the Humanities 26:475-484.
Karttunen, Lauri. 1983. KIMMO: a general morphological processor. Texas Linguistic Forum 22:163-186.
Koskenniemi, Kimmo. 1983. Two-level morphology: a general computational model for word-form recognition and production. Publication No. 11. University of Helsinki: Department of General Linguistics.
Miles, Nathan L. 1991. Automatic generation of two-level FSM tables. M.A. thesis, Ohio State University. [Description of the KGEN rule compiler.]
Sproat, Richard. 1991. Review of "PC-KIMMO: a two-level processor for morphological analysis" by Evan L. Antworth. Computational Linguistics 17.2:229-231.
Document date: 27-Mar-1996