4.7 File formats

up next previous
[ Guide contents | Chapter contents | Next section: 4.8 Trace Formats | Previous section: 4.6 Alphabetic List of Commands ]

4.7.1 Rules file

4.7.2 Lexicon files

4.7.3 Grammar file

4.7.4 Generation comparison file

4.7.5 Recognition comparison file

4.7.6 Pairs comparison file

4.7.6A Synthesis comparison file

4.7.7 Generation file

4.7.8 Recognition file

4.7.8A Synthesis file

4.7.9 Summary of default file names and extensions

Figure 4.1 Structure of the rules file

Figure 4.2 A sample rules file

Figure 4.3 Structure of the main lexicon file

Figure 4.4 A sample main lexicon file

Figure 4.5 Structure of a lexical entry

Figure 4.6 A sample lexical entry

Figure 4.7 Structure of the grammar file

Figure 4.8A A lexical rule example

Figure 4.8B Feature structure before application of lexical rule

Figure 4.8C Feature structure after application of lexical rule

Figure 4.9 A sample grammar file

Figure 4.10 A sample generation comparison file

Figure 4.11 A sample recognition comparison file

Figure 4.12 A sample pairs comparison file

Figure 4.12A A sample synthesis comparison file

Figure 4.13 A sample generation file

Figure 4.14A A sample synthesis file

Figure 4.15 Default file names and extensions


This section describes the formats for the files that are used as input to PC-KIMMO. In any of the files, comments can be added to any line by preceding the comment with the comment character. This character is normally a semicolon (;), but can be changed with the COMMENT keyword in the rules file. Anything following a comment character (until the end of the line) is considered part of the comment and is ignored by PC-KIMMO.

In the descriptions below, reference to the use of a space character implies any whitespace character (that is, any character treated like a space character). The following control characters when used in a file are whitespace characters: ^I (ASCII 9, tab), ^J (ASCII 10, line feed), ^K (ASCII 11, vertical tab), ^L (ASCII 12, form feed), and ^M (ASCII 13, carriage return).

The control character ^Z (ASCII 26) cannot be used because MS-DOS interprets it as marking the end of a file. Also the control character ^@ (ASCII 0, null) cannot be used.

Examples of each of the following file types are found on the release diskette as part of the English description.

4.7.1 Rules file

The general structure of the rules file is a list of keyword declarations. Figure4.1 shows the conventional structure of the rules file. Note that the notation {x | y} means either x or y (but not both). The following specifications apply to the rules file.

Figure 4.1 Structure of the rules file

COMMENT <character>
ALPHABET <symbol list> 
NULL <character> 
ANY <character>
BOUNDARY <character>
SUBSET <subset name> <symbol list>
. (more subsets)
.
. 
RULE <rule name> <number of states> <number of columns>
 <lexical symbol list>
 <surface symbol list> 
<state number>{: | .} <state number list>
  . (more states)
  . 
  . 
. (more rules)
.
. 
END

Figure 4.2 shows a sample rules file.

Figure 4.2 A sample rules file

ALPHABET
  b c d f g h j k l m n p q r s t v w x y z +    ; + is morpheme boundary
  a e i o u
NULL 0
ANY  @
BOUNDARY #
SUBSET C b c d f g h j k l m n p q r s t v w x y z
SUBSET V a e i o u
; more subsets

RULE "Consonant defaults"  1 23
   b c d f g h j k l m n p q r s t v w x y z + @
   b c d f g h j k l m n p q r s t v w x y z 0 @
1: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

RULE "Vowel defaults"  1 6
   a e i o u @
   a e i o u @
1: 1 1 1 1 1 1

RULE "Voicing s:z <=> V___V" 4 4
   V s s @
   V z @ @
1: 2 0 1 1
2: 2 4 3 1
3: 0 0 1 1
4. 2 0 0 0

; more rules

END

4.7.2 Lexicon files

A lexicon consists of one main lexicon file plus one or more files of lexical entries. The general structure of the main lexicon file is a list of keyword declarations. The set of valid keywords is ALTERNATION, FEATURES, FIELDCODE, INCLUDE, and END. Figure 4.3 shows the conventional structure of the lexicon file. The following specifications apply to the main lexicon file.

Figure 4.3 Structure of the main lexicon file

ALTERNATION <alternation name> <sublexicon name list>
. (more ALTERNATIONs)
.
.
FEATURES <feature abbreviation list>

FIELDCODE <lexical item code> U
FIELDCODE <sublexicon code>  L
FIELDCODE <alternation code>  A
FIELDCODE <features code>  F
FIELDCODE <gloss code>  G

INCLUDE <filespec>
. (more INCLUDEd files)
.
.
END

Figure 4.4 shows a sample main lexicon file.

Figure 4.4 A sample main lexicon file

ALTERNATION Begin PREF
ALTERNATION Pref N AJ V AV
ALTERNATION Stem SUFFIX

FEATURES sg pl reg irreg

FIELDCODE  lf   U   ;lexical item
FIELDCODE  lx   L   ;sublexicon
FIELDCODE  alt  A   ;alternation
FIELDCODE  fea  F   ;features
FIELDCODE  gl   G   ;gloss

INCLUDE affix.lex    ;file of affixes
INCLUDE noun.lex     ;file of nouns
INCLUDE verb.lex     ;file of verbs
INCLUDE adjectiv.lex ;file of adjectives
INCLUDE adverb.lex   ;file of adverbs

END

Figure 4.5 shows the structure of a lexical entry. Lexical entries are encoded in "field-oriented standard format." Standard format is an information interchange convention developed by the Summer Institute of Linguistics. It tags the kinds of information in ASCII text files by means of markers which begin with backslash. Field-oriented standard format (FOSF) is a refinement of standard format geared toward representing data which has a database-like record and field structure. The following points provide an informal description of the syntax of FOSF files.

Figure 4.5 Structure of a lexical entry

\<lexical item code> <lexical item>
\<sublexicon code> <sublexicon name>
\<alternation code> {<alternation name> | <BOUNDARY symbol>}
\<features code> <features list>
\<gloss code> <gloss string>

The following specifications apply to how FOSF is implemented in PC-KIMMO.

A file of lexical entries is loaded by using an INCLUDE declaration in the main lexicon file (see above). An INCLUDEd file of lexical entries cannot contain any declarations (such as a FIELDCODE or an INCLUDE declaration), only lexical entries and comment lines.

The following specifications apply to lexical entries.

Figure 4.6 shows a sample lexical entry.

Figure 4.6 A sample lexical entry

\lf  `knives
\lx  N
\alt Infl
\fea pl irreg
\gl  N(`knife)+PL

4.7.3 Grammar file

The grammar file consists of feature templates, context-free rules, and feature constraints. Figure 4.7 shows the conventional structure of the grammar file.

Figure 4.7 Structure of the grammar file

LET <abbreviation | category> be <feature definition>
. (more feature templates)
.
.
DEFINE <lexical rule name> as <mappings>
. (more lexical rules)
.
.
PARAMETER <parameter name> is <parameter value>
. (more parameter settings)
.
.
RULE <rule>
 <feature constraint>
 . (more constraints)
 .
 .
(more rules)
.
.
.
END
The following specifications apply generally to the grammar file.

Rules

The following specifications apply to rules.

A grammar rule has these parts, in the order listed:

  1. the keyword Rule

  2. an optional rule identifier enclosed in braces ({})

  3. the nonterminal symbol to be expanded

  4. an arrow (->) or equal sign (=)

  5. zero or more terminal or nonterminal symbols, possibly marked for alternation or optionality

  6. an optional colon (:)

  7. zero or more feature constraints, possibly marked for alternation

  8. an optional period (.)

The optional rule identifier (item 2) consists of one or more words enclosed in braces. Its current utility is only as a special form of comment describing the intent of the rule. (Eventually it may be used as a tag for interactively adding and removing rules.) The only limits on the rule identifier are that it not contain the comment character and that it all appears on the same line in the grammar file.

The terminal and nonterminal symbols in the rule have the following characteristics:

The symbols on the right hand side of a context-free rule may be marked or grouped in various ways:

Feature structures

The grammar formalism uses a basic element called a feature structure. A feature structure consists of a feature name and a value. The notation used for feature structures looks like this:
 [number: singular]
where number is the feature name and singular is the value, separated by a colon. Feature names and values are single words consisting of alphanumeric characters or other characters except (){}[]<>=:$! (these are used for special purposes in the grammar file). Upper and lower case letters used in feature names and values are considered different. For example, "NUMBER" is not the same as "Number" or "number."

A structure containing more than one feature uses square brackets around the entire stucture:

 [number: singular
   case:  nominative]
Extra spaces and line breaks are optional.

Feature structures can have either simple values, such as the example above, or complex values, such as this:

 [agreement: [number: singular]
             case:  nominative]]
where the value of the agreement feature is another feature structure. Feature structures can be infinitely nested in this manner.

Feature can share values. This is not the same thing as two features having identical values. In the first example below, the features a and c have identical values; but in the second example, they share the same value:

  [a: [p:q]
  b: [p:q]]
  [a: $1[p:q]
  b: $1]
Shared values are indicated by coindexing them with the prefix $1, $2, and so on.

Portions of a feature structure can be referred to using the "path" notation. A path is a sequence of feature names (minimally one) enclosed in angled brackets (<>). For example, consider this feature structure:

 [agreement: [number: singular
             case: nominative]]
These are feature paths based on this structure:
 <number>
 <case>
 <agreement number>
 <agreement case>
Paths are used in feature templates and feature constraints, described below. All lexical items used by the grammar are assigned three features: cat, lex, and gloss. These should be treated as reserved names and not used for other purposes.

For example, here is a lexical entry for the word fox:
 \lf `fox
 \lx N
 \alt Stem
 \gl N(fox)
When this entry is used by the grammar, it is represented as this feature structure:
 [cat: N
  lex: `fox
  gloss: N(fox)]

Feature constraints

A rule is followed by zero or more feature constraint; which refer to symbols used in the rule. The following specifications apply to feature constraints.

A feature constraint has these parts, in the order listed:

  1. a feature path that begins with one of the symbols from the context-free rule

  2. an equal sign

  3. either another path or a value

A feature constraint that refers only to symbols on the right hand side of the rule constrains their co-occurrence. In the following rule and constraint, the value of the Stem's head pos feature must unify with the value of the SUFFIX's from_pos feature:

        Word -> Stem INFL
                <Stem head pos> = <INFL from_pos>
If a feature constraint refers to a symbol on the right hand side of the rule, and has an atomic value on its right hand side, then the designated feature must not have a different value. In the following rule and constraint, the head case feature for the PRONOUN node of the parse tree must either be originally undefined or equal to NOM:
        Word -> PRONOUN
                <PRONOUN head case> = NOM
(If the head case feature of the PRONOUN node was originally undefined, then, after unification succeeds, it will be equal to NOM.)

A feature constraint that refers to the symbol on the left hand side of the rule passes information up the parse tree. In the following rule and constraint, the value of the head feature is passed from the INFL node up to the Word node:

        Word -> Stem INFL
                <Word head> = <INFL head>
PC-KIMMO allows disjunctive feature constraints with its phrase structure rules. Consider these two rules:
Stem_1 -> PREFIX Stem_2
        <PREFIX from_pos> = <Stem_2 head pos>
        <PREFIX change_pos> = +
        <Stem_1 head> = <PREFIX head>

Stem_1 -> PREFIX Stem_2
        <PREFIX from_pos> = <Stem_2 head pos>
        <PREFIX change_pos> = -
        <Stem_1 head> = <Stem_2 head>
These rules have the same context-free rule part. They can therefore be collapsed into this single rule , which has a disjunction in its feature constraints:
Stem_1 -> PREFIX Stem_2
        <PREFIX from_pos> = <Stem_2 head pos>
       {
        <PREFIX change_pos> = +
        <Stem_1 head> = <PREFIX head>
        /
        <PREFIX change_pos> = -
        <Stem_1 head> = <Stem_2 head>
        }
Disjunctive feature constrains may be nested up to eight levels deep.

Feature templates

The following specifications apply to feature templates.

A feature template has these parts, in the order listed:

  1. the keyword Let

  2. the template name

  3. the keyword be

  4. a feature definition

  5. an optional period (.)
If the template name is a terminal category (a terminal symbol in one of the context-free rules), the template defines the default features for that category. Otherwise the template name serves as an abbreviation for the associated feature structure. Templates may occur anywhere in the file (interspersed among the rules), but a template must occur before any rule or other template that uses the abbreviation it defines.

Template names are single words consisting of alphanumeric characters or other characters except (){}[]<>=:$! (these are used for special purposes in the grammar file). The character \ should not be used as the first character of a template name because that is how fields are marked in the lexicon file. Upper and lower case letters used in template names are considered different. For example, "PLURAL" is not the same as "Plural" or "plural."

The abbreviations defined by templates are usually used in the feature field of entries in the lexicon file. For example, the lexical entry for the irregular plural form feet may have the abbreviation pl in its features field. The grammar file would define this abbreviation with a template like this:

     Let pl be [number: PL]
The path notation may also be used:
     Let pl be <number> = PL
More complicated feature structures may be defined in templates. For example,
     Let 3sg be [tense:  PRES
                 agr:    3SG
                 finite: +
                 vform:  S]
which is equivalent to:
    Let 3sg be [<tense>  = PRES
                <agr>    = 3SG
                <finite> = +
                <vform>  = S]
In the following example, the abbreviation irreg is defined using another abbreviation:
    Let irreg be <reg> = -
                 pl
The abbreviation pl must be defined previously in the grammar file or an error will result. A subsequent template could also use the abbreviation irreg in its definition. In this way, an inheritance hierarchy features may be constructed.

Feature templates permit disjunctive definitions. For example, the lexical entry for the word deer may specify the feature abbreviation sg-pl. The grammar file would define this as a disjunction of feature structures reflecting the fact that the word can be either singular or plural:

    Let sg/pl be {[number:SG]
                  [number:PL]}
This has the effect of creating two entries for deer, one with singular number and another with plural. Note that there is no limit to the number of disjunct structures listed between the braces. Also, there is no slash (/) between the elements of the disjunction as there is between the elements of a disjunction in the rules. A shorter version of the above template using the path notation looks like this:
    Let sg/pl be <number> = {SG PL}
Abbreviations can also be used in disjunctions, provided that they have previously been defined:
    Let sg be <number> = SG
    Let pl be <number> = PL
    Let sg/pl be {[sg] [pl]}
Note the square brackets around the abbreviations sg and pl without square brackets they would be interpreted as simple values instead.

Feature templates can assign default atomic feature values, indicated by prefixing an exclamation point (!). A default value can be overridden by an explicit feature assignment. This template says that all members of category N have singular number as a default value:

    Let N be <number> = !SG
The effect of this template is to make all nouns singular unless they are explicitly marked as plural. For example, regular nouns such as book do not need any feature in their lexical entries to signal that they are singular; but an irregular noun such as feet would have a feature abbreviation such as pl in its lexical entry. This would be defined in the grammar as [number: PL], and would override the default value for the feature number specified by the template above. If the N template above used SG instead of !SG, then the word feet would fail to parse, since its number feature would have an internal conflict between SG and PL.

2.3 Parameter settings

Parameter settings are used to override various default settings assumed in the grammar file. Parameter settings are optional. In the absence of a parameter setting, a default value is used. A parameter setting has these parts, in the order listed:
  1. the keyword Parameter

  2. an optional colon (:)

  3. one or more keywords identifying the parameter

  4. the keyword is

  5. the parameter value

  6. an optional period (.)

PC-KIMMO recognizes the following parameters:

2.4 Lexical rules

Lexical rules are used to modify the feature structures of lexical entries. As noted in Shieber 1985, something more powerful than just abbreviations for common feature elements is sometimes needed to represent systematic relationships among the elements of a lexicon. This need is met by lexical rules, which express transformations rather than mere abbreviations.

Lexical rules are similar to feature templates, but are more powerful. While feature templates assign a feature structure to lexical items by means of unification, lexical rules map one feature structure to another, thus transforming it. The name of a lexical rule is included in the features field of lexical entries, similar to feature abbreviations.

A lexical rule has these parts, in the order listed:

  1. the keyword Define

  2. the name of the lexical rule

  3. the keyword as

  4. the rule definition

  5. an optional period (.)
The rule definition consists of one or more mappings. Each mapping has three parts: an output feature path, an assignment operator, and the value assigned, either an input feature path or an atomic value. Every output path begins with the feature name out and every input path begins with the feature name in. The assignment operator is either an equal sign (=) or an equal sign followed by a "greater than" sign (=>). (These two operators are equivalent in PC-KIMMO, since the implementation treats each lexical rule as an ordered list of assignments rather than using unification for the mappings that have an equal sign operator.) Consider the information shown in figure 4.8A.

Figure 4.8A A lexical rule example

;lexical item
\lf `mouse
\fea irreg POS_Gloss
\gl `mouse

;feature template
LET irreg be  = -

;lexical rule
DEFINE POS_Gloss as 
            = 
            = 
            = 
            = .
The feature field (\fea ) of the lexical entry contains two labels: irreg is a feature abbreviation and is defined by a feature template (the LET statement), while POS_Gloss is the name of a lexical rule which is defined by the DEFINE statement.

Figure 4.8B Feature structure before application of lexical rule

[ cat:   ROOT
  head:  [ agr: [ 3sg:- ]
           number:PL
           pos:   N
           proper:-
           verbal:- ]
  reg:   -
  lex:   `mice
  gloss: `mouse ]

Figure 4.8C Feature structure after application of lexical rule

[ cat:   ROOT
  head:  [ agr: [ 3sg:- ]
           number:PL
           pos:   N
           proper:-
           verbal:- ]
  lex:   `mice
  gloss: N ]
When the lexicon entry is loaded, it is initially assigned the feature structure shown in figure 4.8B, which is the unification of the information given in the various fields of the lexicon entry, including the feature abbreviation pl. After the complete feature structure has been built, the lexical rule named POS_Gloss is applied, producing the feature structure shown in figure 4.8C. Note that the change in the value of the gloss feature from "`mouse" to "N" is done by direct mapping, not unification.

There are two important points about using lexical rules. First, the feature structure of a lexical item that has undergone a lexical rule is entirely determined by the mappings in the lexical rule. In the lexical rule in figure 4.8A, the first three mappings (for cat, head, and lex), though they seem redundant, are needed to carry over these feature values from the input feature structure to the output feature structure. Notice that the feature reg which is present in the input feature structure in figure 4.8B is absent from the output feature structure in figure 4.8C; this is due to the fact that the lexical rule which applied to the feature structure did not include a mapping for the reg feature.

Second, lexical rules apply sequentially in the order in which they are given in the grammar file.

Figure 4.9 shows a sample grammar file.

Figure 4.9 A sample grammar file

    ;FEATURE TEMPLATES (optional)

    ;Feature definitions
    Let pl be   <head number> = PL
    LET v/n be  <from_pos> = V
                <head pos> = N
                <head number> = !SG
    LET v\aj be <from_pos> = AJ
                <head pos> = V
    
    ;Category definitions
    Let N be  <cat> = ROOT
              <head pos> = N
              <head number> = !SG
    Let V be  <cat> = ROOT
              <head pos> = V
    Let AJ be <cat> = ROOT
              <head pos> = AJ

    ;PARAMETER SETTINGS (optional)

    PARAMETER Start symbol is Word

    ;RULES

    RULE
    Word = Stem INFL
        <Stem head pos> = <INFL from_pos>
        <Word head> = <INFL head>
    
    RULE
    Stem_1 = PREFIX Stem_2
        <PREFIX from_pos> = <Stem_2 head pos>
        <Stem_1 head> = <PREFIX head>
    
    RULE
    Stem_1 = Stem_2 SUFFIX
        <Stem_2 head pos> = <SUFFIX from_pos>
        <Stem_1 head> = <SUFFIX head>
    
    RULE
    Stem = ROOT
        <Stem head> = <ROOT head>

4.7.4 Generation comparison file

The generation comparison file serves as input to the compare generate command (see section 4.5.12). It consists of groupings of a lexical form followed by one or more surface forms that are expected to be generated from the lexical form. The following specifications apply to the generation comparison file.

Figure 4.10 shows a sample generation comparison file.

Figure 4.10 A sample generation comparison file

`trace+ed
 traced

`trace+able
 traceable

re-+`trace
 re-trace
 retrace

4.7.5 Recognition comparison file

The recognition comparison file serves as input to the compare recognize command (see section 4.5.12). It consists of groupings of a surface form followed by one or more lexical forms that are expected to be recognized from the surface form. The following specifications apply to the recognition comparison file.

Figure 4.11 shows a sample recognition comparison file.

Figure 4.11 A sample recognition comparison file

traced
 `trace+ed     [ V(trace)+PAST ]
 `trace+ed     [ V(trace)+PAST.PRTC ]

traceable
 `trace+able     [ V(trace)+ADJR ]

retrace
 re-+`trace     [ REP+V(trace).INF ]

4.7.6 Pairs comparison file

The pairs comparison file serves as input to the compare pairs command (see section 4.5.12). It consists of pairs of lexical and surface forms; that is, a lexical form followed by exactly one surface form. It is expected that the surface form will be recognized from the lexical form and that the lexical form will be generated from the surface form. Glosses do not have to be included with lexical forms, since the generator does not use the lexicon; however, including a gloss with the lexical form does no harm--it is simply ignored. When recognizing a surface form, the lexicon is used to identify the constituent morphemes and verify that they occur in the correct order, but the gloss part of a lexical entry is not used. The following specifications apply to the pairs comparison file.

Figure 4.12 shows a sample pairs comparison file.

Figure 4.12 A sample pairs comparison file

`trace+ed
traced

`trace+able
traceable

re-+`trace
re-trace

re-+`trace
retrace

4.7.6A Synthesis comparison file

The synthesis comparison file serves as input to the compare synthesize command (see section 4.5.12). It consists of groupings of a morphological form followed by one or more surface forms that are expected to be synthesized from the morphological form. The following specifications apply to the synthesis comparison file.

Figure 4.12A shows a sample synthesis comparison file.

Figure 4.12A A sample synthesis comparison file

`trace +ED
traced

`trace +EN
traced

`trace +AJR25a
traceable

ORD5+ `trace
retrace

4.7.7 Generation file

The generation file consists of a list of lexical forms. It serves as input to the file generate command (see section 4.5.13), which returns a file (or screen display) whose format is identical to the generation comparison file. The following specifications apply to the generation file.

Figure 4.13 shows a sample generation file.

Figure 4.13 A sample generation file

`cat
`cat+s
`cat+'s
`cat+s+'s
`fox
`fox+s
`fox+'s
`fox+s+'s

4.7.8 Recognition file

The recognition file consists of a list of surface forms. It serves as input to the file recognize command (see section 4.5.14), which returns a file (or screen display) whose format is identical to the recognition comparison file. The following specifications apply to the recognition file.

Figure 4.14 shows a sample recognition file.

Figure 4.14 A sample recognition file

cat
cats
cat's
cats'
fox
foxes
fox's
foxes'

4.7.8A Synthesis file

The synthesis file consists of a list of morphological forms. A morphological form is a sequence of morpheme glosses separated by spaces. A synthesis file serves as input to the file synthesis command (see section 4.5.13), which returns a file (or screen display) whose format is identical to the synthesis comparison file. The following specifications apply to the synthesis file.

Figure 4.14A shows a sample synthesis file.

Figure 4.14A A sample synthesis file

`cat
`cat +PL
`cat +GEN
`cat +PL +GEN
`fox
`fox +PL
`fox +GEN
`fox +PL +GEN

4.7.9 Summary of default file names and extensions

Figure 4.15 summarizes the default file names and extensions assumed by PC-KIMMO. Two entries are given for the different kinds of files. The first is the name PC-KIMMO will assume if no file name at all is given to a command that expects that kind of file. The second entry (with the *) shows what extension PC-KIMMO will add if a file name without an extension is given.

Figure 4.15 Default file names and extensions

Rules file:                    RULES.RUL
                                   *.RUL
Lexicon file:                LEXICON.LEX
                                   *.LEX
Grammar file:                GRAMMAR.GRM
                                   *.GRM
Generation comparison file:     DATA.GEN
                                   *.GEN
Recognition comparison file:    DATA.REC
                                   *.REC
Pairs comparison file:          DATA.PAI
                                   *.PAI
Synthesis comparison file:      DATA.SYN
                                   *.SYN
Take file:                   PCKIMMO.TAK
                                   *.TAK
Log file:                    PCKIMMO.LOG
                                   *.LOG

up next previous
[ Guide contents | Chapter contents | Next section: 4.8 Trace Formats | Previous section: 4.6 Alphabetic List of Commands ]