OMorFi: hfst-twolc -- An Open-Source Two-Level Grammar Compiler

Usage

htwolc --input FILE [ --output FILE ] [ --test_file FILE ] [ --test ] [ --no-report ] [ --resolve ]

Parameter name Meaning
input the rule file.
output If omitted, the resulting transducer is written to STDOUT.
test_file A file containing test-pairs for the grammar.
test Toggle test-mode. If this parameter is present, the rules won't be compiled, but tested instead.
no-report Don't warn about conflicts between rules. If omitted, all rule-conflicts will give a warning.
resolve Attempt to resolve conflicts between rules. If omitted, conflicts aren't resolved.

Outline

Terms and concepts:

input string
the string to be transformed by a FST (in Xerox terminology upper string; in SFST terminology analysis string, sometimes the deep string)
output string
the string into which the FST transforms the input string (in Xerox terminology lower string; in SFST terminology surface string)
set of characters
a set of characters (in SFST terminology range but the word "range" would imply the inclusion of all members between the two extremes)
set of pairs
a subset of feasible character pairs (corresponds to the disjunction of the pairs listed in the definition).
input symbol
a token to be input to a FST; the left-hand side of a pair, i.e. a in a pair a:b

Getting the Program

There is now a binary of the program available, as well as the source-code, on the Source code page on the hfst webpage (on the webpage of the Department of General Linguistics in Helsinki University).

The program may be downloaded from the CVS-repository on Corpus. The path of the CVS-repository is /c/appl/ling/koskenni/cvsrepo/. Currently hfst-twolc is in the directory htwolc, which contains the files

-rw-r--r--  1 silfverb kikosken  6923 16. heinä  17:40 commandline.h
drwxr-xr-x  2 silfverb kikosken  1024 18. heinä  18:31 CVS
-rw-r--r--  1 silfverb kikosken 11239 16. heinä  17:40 htwolc.yy
-rw-r--r--  1 silfverb kikosken   849 13. kesä   12:43 Makefile
-rw-r--r--  1 silfverb kikosken  5888 11. heinä  14:58 muutokset
-rw-r--r--  1 silfverb kikosken 25506 16. heinä  17:40 operations.C
-rw-r--r--  1 silfverb kikosken 13145 16. heinä  17:40 operations.h
-rw-r--r--  1 silfverb kikosken    45  6. touko  23:42 README
-rw-r--r--  1 silfverb kikosken   443  4. heinä  17:34 test_file
-rw-r--r--  1 silfverb kikosken  6955 10. heinä  18:41 tokenizer.ll

Dependencies

You should have hfst installed.

Installing the Program

If you're working on corpus, make should be sufficient. You may need to modify the variable

HFSTPATH=../hfst/
depending on, where you've got hfst installed.

Syntax

A twol-grammar consists of four parts: Alphabet, Sets, Definitions and Rules. Each part contains statements, that end in a ; character and comments, that begin with a ! character and span to the end of the line.

Alphabet

! The alphabet should contain all pairs used in the rules.
! Characters consist of strings of utf-8 characters. No white-space, though!
a b c d e f g h i j k l m n o p q r s t u v w x y z å ä ö N:n N:m ;

Sets
Consonant = b c d f g h j k l m n p q r s t v w x z m n ;
Vowel = a e i o u y å ä ö ;

Definitions

ClosedSyllable = :Vowel+ [ ~:Vowel ]+ ;

Rules

! input/output -pairs for testing the rule-set:

!input:  k a N p a n 
!output: k a m m a n

!input:  k a N T a n 
!output: k a n n a n

!input:  k a m p i 
!output: k a m p i

"N:m before input-character p"
! A common morpho-phonetic phenomenon
N:m <=> _ p: ;

"Degradation of p to m after input-character N"
p:m <=> N: _ ;

The rules in the example grammar are from Karttunen 1992. Many of the examples in this manual are taken either from Karttunen 1992 or Karttunen and Koskenniemi 1987.

Regular Expression Syntax

Any character-pair defined in the alphabet is a regular expression e.g. a or a:b. The following special pair-constructs are available:

  • a:? and a: match any pair in the alphabet having input-character a.
  • ?:a and :a match any pair in the alphabet having output-character a.
  • ? matches any pair in the alphabet.
  • ?:? same as ?.
  • 0 matches the empty string.

By concatenating pairs, one can build longer regular expressions matching pairs of strings. If the alphabet is declared

Alphabet
a N:n N:m e
then the regular expression a N: e will match the pairs of strings a N:n e and a N:m e.

Regular expressions can be grouped together using the parenthesis-constructions [ ... ] ans ( ... ). If R is a regular expression, then [ R ] matches exactly the same pairs of string as R does. The construction ( R ), on the other hand, matches the empty string, as well.

Grouping becomes important, when one uses unary regular expression operators. Unary operators like * have higher precedence, than concatenation. This means that e.g. a b* is equivalent to [ a ] [ b * ]. If one wants the * operator to apply to the whole expression a b one has to group the expressions a and b together i.e. [ a b ]*.

There are seven unary regular-expression operators in hfst-twolc for the time being. Let the Alphabet be [ a N:n N:m o] and let R denote a regular expression. The unary operators are:

  • The power-operator ^INTEGER, which is equivalent to concatenation of the argument-expression with itself INTEGER times. E.g. a^3 is equivalent a a a.
  • The containment-operator $. The regular-expression $R matches any string containing at least one substring matched by R. E.g. $a is equivalent to [ a N:n N:m e ]* a [ a N:n N:m e]*, with the alphabet defined above.
  • The exact containment-operator $. is similar to the containment operator, but the mathcing strings have to contain exactly one substring matching R. E.g. $.a is equivalent to [ N:n N:m e ]* a [ N:n N:m e]* with the Alphabet defined above.
  • The term-complement-operator \. The term-complement of R is the language \R containing every pair, that is not matched by R. E.g. \a is equivalent to [ N:n N:m e ] with the Alphabet defined above. Note that the term-complement is not the same thing as the negation of a language.
  • The negation-operator ~. The negation of a regular-expression R contains all strings not matched by R.
  • The Kleene-star *. The language R* matches any string, that is a concatenation of any number of string from R. Note that the empty string, which is the concatenation of zero strings also matched. E.g. a* matches the empty string, a, a a, a a a and so on.
  • The plus-operator resembles the *, but it only matches strings, which are concatenation of a positive number of strings from R. Consequently R+ matches the empty string, iff R matches the empty string. E.g. a+ matches a, a a, a a a and so on.

In addition to the unary operators there are three binary operators, which may be used to build regular expressions out of existing ones. Binary operators have the lowest precedence. Hence, when using the disjunction-operation |, e.g. a b* | c d is equivalent to [ a b* ] | [ c d ] and will match anything matched by a b* or by c d. One can group expressions together so a [ b * | c ] d will match a string beginning with a followed by zero or more b symbols or a c and ending with a d.

Let R and S be regular expressions. The binary operators are:

  • The disjunction-operator |. The language R | S matches any string matched by R or S and only those.
  • The conjunction-operator &. The language R & S matches any string matched by both R and S and only those.
  • The difference-operator -. The language R - S matches any string matched by R, but not by S and only those.

By default the binary operations bind from the left. Hence a - a - a is equivalent to [ a - a ] - a i.e. matches the empty language. If the binary operators would bind from the right, then a - a - a would be equivalent to a - [ a - a ] i.e. equivalent to a.

Operator Precedence

The operators in htwolc have different precedence. As a rule of thumb unary operators are the strongest, then concatenation and last binary operators. The constructions [ ... ] and ( ... ) override all other precedence rules.

Operators ordered by precedence from strongest to weakest:

  1. Unary operators: ^INTEGER, $, $., \, ~, *, +
  2. Concatenation
  3. Binary operators: |, & -

E.g. ~a^3 b | c d* is interpreted as

[  [ ~[ a ^ 3]  ] b ] | [ c [ d* ]  ]

The Alphabet

The first part specifies the alphabet of the rules. The alphabet consists of pairs consisting of a input-character and a output-character like a:a or N:m. If the input-character and output-character are the same, it is customary to denote the pair by the input-symbol, so a:a is usually written a.

Every pair of character referred to in some of the rules, has to be declared in the alphabet. Otherwise a warning will be issued. The grammar will still be compiled, but the rules may be compiled erroneously. E.g. the any-character ? denotes any pair declared in the alphabet and only those. Hence ? won't match pairs, that aren't declared in the alphabet.

Any non-empty string of non-white-space UTF-8 characters, that isn't a reserved word, is a valid alphabet-character. For now this means, that the characters shouldn't contain newlines, spaces, tabs or carriage-returns and shouldn't be found in the section List of reserved words below.

The Sets

The second part of the grammar specifies named character-ranges like

Vowel  = a e i o u y å ä ö ;

Sets may be used in rules as a short-hand for collections of character-pairs. Perhaps one might want write a rule, which states, that the phoneme t is realised as its voiced fricative counter-part ө between two phonemes, which are realised as vowels. This could be accomplished by a rule

t:&#1257; <= :Vowel _ :Vowel ;

The Definitions

The third part of the grammar specifies named regular expressions, which may be used as a part of definitions of rules, e.g.

ClosedSyllable = Vowel+ [ ~Vowel ]+ ;

The regular-expression syntax is the same as the syntax used in the two-level rules of the grammar. It is possible to define a named regular expression having the same name as a set or alphabet character. It will over-shadow the declaration of the set.

The Rules

Ordinary Two-Level Rules

Two-level rules consist of a center, a rule-operator and contexts. The center is a pair of characters, and a context consists of two regular expressions separated by an underscore. Schematically

a:b OP L1 _ R1 ;
       L2 _ R1 ;
         ...
       Ln _ Rn ;

A rule has to have at least one context and it may have as many as are needed.

The rules are constraints, regulating the distribution of the center-pairs according to the rule-operator and contexts given. Four different kinds of rules-operators may be used in hfst-twolc

<=, =>, <=> and \<=
The final context, which is compiled into the transducer representing the two-level rule is the union of the contexts given.

Right-arrow rules constrain the distribution of a symbol-pair by specifying, that it may only occur in a specific context (or some specific contexts). Let the set V be the set of vowels in some language. An example of a right-arrow rule is

I:j => :V _ :V ;
It states, that the input-character I can be realised as j only in a contex, where it is surrounded by output vowels, i.e. that the occurrence of the pair I:j is limited to positions between surface-vowels.

The context :V _ :V in the example is automatically extended to a so called total context, by hfst-twolc. This means that, when the rule is compiled, the context will become ?* :V _ :V ?*. This applies to all kinds of rule-operators.

Left-arrow rules constrain the set of output-characters corresponding to an input-character in some context. An example of a left-arrow rule is

N:m <= _ p: ;
It states, that an input-character N has to be realized as the output-character m if it is followed by some pair with input-character p.

Left-arrow rules differ from right-arrow rules, because they are asymmetric with regard to the input- and output-level of pair-strings. The right-arrow example above, doesn't limit the input-character of a pair preceding p:, it only limits the output-character, if the input-character is N. Such an asymmetry is not present in left-arrow rules, which limit a particular pair into a particular kind of context.

Left/right -arrow rules, give a necessary and sufficient conditions for the realization of an input-character as some output-character. An example of a left/right -arrow rule is

k:' <=> :a :a _ :a ;
which states, that k:' is realized as ' exactly in contexts where two output a phones precede and one follows (this describes a convention of Finnish orthography stemming from consonant degradation). Any left/right arrow rule is equivalent to the joined effect of the corresponding left- and right-arrow rules. Hence the example is equivalent to the pair of rules
k:' <= :a :a _ :a ;
and
k:' => :a :a _ :a ;

Prohibition rules disallow the realization of an input-character as some output-character in some contexts. Let again V denote the set of vowels. An example of a prohibition rule is

I:i \<= :V _ :V ;
which states, that the input-character I may not be realized as i between output-vowels.

Like right-arrow rules, prohibition rules are symmetric with respect to the input- and output-level of pair-strings. In fact it is often possible to state a particular constraint both as a prohibition rule concerning some pair and a left-arrow rule concerning an other. If the input-character I may only be realized as i or j, then the rules

I:i \<= :V _ :V ;
and
I:j => :V _ :V ;
state the exactly same constraint. Still, if the number of realizations is greater, it may be much easier to state the constraint using one of the operators than the other.

Generalized Context-Restrictions

Special Rule-Constructs

Error-Messages and Warnings

Warnings

The following is an example of a wring given by hfst-twolc

WARNING! LINE 7:
[1] The pair a:b wasn't declared in the alphabet!
a:b <= c _ ;
  ^ HERE
The program attempts to report the number of the line, which gives the warning and also point to the place, which gives the warning. Note, that the place and line given may not be accurate. When they're not, the problem is often on the previous line.

The number [1] means, that this is a warning of type 1. There are 6 types of warnings. These are

  1. E.g. the grammar
    Alphabet
            a b c ;
    
    Rules
    
    "R1"
    a:b <= c _ ;
    gives a type 1 warning
    WARNING! LINE 7:
    [1] The pair a:b wasn't declared in the alphabet!
    a:b <= c _ ;
    The pair a:b was used, but not declared in the alphabet. Since only pairs, that have been declared in the alphabet match ?, failing to declare a:b would probably prohibit all pair-strings containing a:b, if the grammar has more than one rule.
  2. (PARTIALLY UNIMPLEMENTED) E.g. the grammar
    Alphabet
            a b c a:b ;
    
    Sets
    
          A = a ;
          A = b ;
    
    Rules
    
    "R1"
    a:b <= c _ ;
    gives a type 2 warning
    WARNING! LINE 9:
    [2] You are redefining the set A
    
    ^ HERE
    
    Here the set A has been defined twice. This is not fully functional yet! It will warn, if the user attempts to define a set with the same name as an alphabet-symbol. The later definition will be used, when compiling rules.

The warning-message is given for the line after the actual redefining line. The cause for this is, that the parser of hfst-twolc eats the newlines after the set-declaration before compiling it.

  1. (PARTIALLY UNIMPLEMENTED) The grammar
    Alphabet
            a b c a:b ;
    
    Sets
    
          A = a ;
    
    Definitions
    
          A = a* ;
    
    Rules
    
    "R1"
    a:b <= c _ ;
    gives the type 3 warning
    WARNING! LINE 12:
    [3] You are redefining the expression or set A
    
    ^ HERE
    The symbol A is declared once as a set and another time as a definition (a warning is also issued, if there are two declarations of the same definition). This is not fully functional yet! It will warn, if the user attempts to define a set with the same name as an alphabet-symbol. The later definition will be used, when compiling rules.

The warning-message is given for the line after the actual redefining line. The cause for this is, that the parser of hfst-twolc eats the newlines after the set-declaration before compiling it.

  1. Warnings of type 4, won't be given for now, since the compilation of rules has changed significantly, and this warning hasn't been reimplemented yet. Type 4 warnings warn about defining the same rule twice.

  1. The grammar

Resolution of Conflicts between the Rules

A pair-string is accepted by a two-level grammar, iff it is accepted by each of the rues in the grammar. Hence there may be strings, that are accepted by some of the rules and rejected by others. While this is often intentional, there are at least two cases, where it has shown to be beneficial for the overall quality of the grammar to make some automatical modifications to the rules. These so called right- and left-arrow conflicts are handled by the mechanism of conflict-resolution in hfst-twolc.

A situation, where one rule accepts a pair-string and another rejects it, shouldn't always be regarded as a conflict. In hfst-twolc it is regarded as a conflict, only if both of the rules are actually applied in the sense discussed in Yli-Jyrä and Koskenniemi 2006. Normal rule-interaction constrains the surface-realizations of some input-form, but do not loose all of them. In contrast to this rule-conflicts often filter away some input-forms completely. There are many kinds of conflicts, but for the time-being only right-arrow conflicts and left-arrow-conflicts are automatically resolved by hfst-twolc.

Unless hfst-twolc is run with the commandline-parameter --no-report, it will report all rule-conflicts, it observes and if it is run with the parameter --resolve, it will resolve the conflicts.

The examples given below of right-arrow and left-arrow conflicts are very similar to those given in Karttunen, Koskenniemi and Kaplan 1987.

Right-Arrow Conflicts

Right-arrow conflicts occur between right-arrow rules (or left-right-arrow rules) with identical centers. Consider the rules

"Rule 1"
a:b => c _ ;

"Rule 2"
a:b => d _ ;

Since Rule 1 requires, that all pairs a:b have to be preceeded by c and Rule 2, that they have to be preceeded by d, their intersection disallows all occurrences of a:b. This may be considered to be an accident.

When hfst-twolc encounters rules, that are in right-arrow-conflict, it reports and resolves the conflict There is a => conflict between the rules Rule1 and Rule2 with respect to the center a:b. Resolving the conflict by joining contexts. by collapsing the rules into a single rule a:b => c _ ; d _ ;

Left-Arrow Conflicts

Left-arrow conflicts occur between left-arrow rules, that deal with the same center-input-character, but different center-output-characters and non-disjoint contexts. Let X denote the set c d. Consider the rules

"Rule 3"
a:b <= c _ ;

"Rule 4"
a <= X _ ;

Rule 3 requires, that an input a be realised as a b following c. The problem is that Rule 4 requires, that it be realised as a following any pair in X:X, among others c. Hence the total effect of the rules is to disallow the occurrence of a pair with input-character a before the pair c.

In the example, Rule 3 may be regarded as a special case of Rule 4, since the context c _ is a sub-context of the more general X _. This might not be the case though. The contexts might be such, that neither is a sub-context of the other. This makes left-arrow-conflicts more complicated than right-arrow-conflicts.

The approach taken in hfst-twolc is to warn about all left-arrow conflicts, but only fix those left-arrow conflicts, where one of the rules is a special case of the other. The conflict is fixed by modifying the more general rule so, that it only applies in contexts, where the more specific rule doesn't apply. In the example above, the resolution-process doesn't effect Rule 3, but changes Rule 4, so that it becomes equivalent with the rule

 
a <= d _ ;

Left/Right -Arrow Conflicts

Besides left- and right-arrow conflicts, there are other kinds of unfortunate interactions between rules. Currently hfst-twolc neither reports, nor fixes such interactions, which makes it important for the grammar-writer to be aware of the possibility of them. Left/right -arrow conflicts involve operators of different types and come in two flavors.

Rules with Identical Centers

Consider the rules

a:b => c _ ;
and
a:b <= d _ ;
The first rule requires, that the a:b pair is immediately preceded by the pair c. The second rule requires, that a be realised as b always when it is preceded by d. Together the rules prohibit the occurrence of an input-character a before the input-character d.

Rules with Different Centers.

Consider the rules

a:b => c _ ;
and
a <= c _ ;
These rules together prohibit the occurrence of the pair a:b anywhere, since a has to be realized as a after c, but this is the only position, where a could be realised as b.

List of Reserved Words

Alphabet  Definitions  Rules  Sets 
!         ;            ?      :        
_         |            =>     <=        
<=>       \<=          [      ]
(         )            *      +
$         $.           ~      <
>         -            "      \
=         0           ^       #
The words and constructs may be used in rules by quoting with \. E.g. \? means question-mark, not any character-pair defined in the alphabet and \Sets is an ordinary name Sets not a declaration, that definitions of sets will follow. In the previous example \Sets could be used as a character in the alphabet, the name of a regular expression in the definition section of the grammar or the name of a set.

A Test-Tool for Grammars

Unimplemented Features

This list contains features which, for the time being, are lacking from hfst-twolc, but will be added, or have been implemented differently from Xerox twolc, but will be changed. The missing features are gathered from Karttunen and Koskenniemi 1987 and Karttunen 1992.

  • Diacritics.

Partial implementations

Since this is an alpha-version of hfst-twolc, there are many features, that have limited functionality.

The where ... ( matched | freely | mixed ) construction is implemented, but is partial in many respects. There is no support for using set-names in the =where=-part of the rule. Hence one can't write a rule

t:d <= Vx _ Vx; where Vx in Vowels;
where Vowels is the set a e i o u y ä ö to say that t becomes d between like vowels. Instead one has to write
t:d <= Vx _ Vx; where Vx in (a e i o u y ä ö);

There is no support for either the freely or mixed options. E.g.

X:Y => a _ ; where X in (s t) Y in (u v);
means the same as
X:Y => a _ ; where X in (s t) Y in (u v) matched;
i.e. is equivalent to the intersection of the rules
s:u => a _ ;
t:v => a _ ; 
Though there is no support for freely, the option can easily be simulated by writing the rule
X:Y => a _ ; where X in (s t) and Y in (u v);
This makes the rule equivalent to the intersection of the rules
s:u => a _ ;
t:v => a _ ;
s:v => a _ ;
t:u => a _ ; 

Rules defined with variables, may easily com into conflict with eachother. For now this is treated as any other rule-conflict. Consider the rule

x:y => A _ A ; where A in (s t);
The subcases
x:y => s _ A ;
x:y => t _ A ;
are in a right-arrow conflict with each-other. This is easily solved by conflict-resolution. The case of left-arrow rules is less fortunate. They may easily come into unresolvable conflict with each-other, when the center involves variables.

Conflict-resolution may be very slow.

The OpenFst-implementation may be very slow.

Differences from Xerox twolc

This list contains features, which are intended to differ from corresponding features in the Xerox twolc program.

  • All valid character-pairs should be declared in the Alphabet. Other character-pairs may be used in the rules, but this will raise a warning. The construction ? (and corresponding constructions) in regular expressions only matches character-pairs, which have been declared in the Alphabet.

References

  • A. Yli-Jyrä, K. Koskenniemi, Compiling Generalized Two-Level Rules and Grammars, Advances in Natural Language Processing, Springer Berlin/Heidelberg, pages 174-185, 2006

-- MiikkaSilfverberg - 13 May 2008