OMorFi: htwolc -- An Open-source Two-level Grammar Compiler


htwolc [ --lexicon FILE ] --input FILE [ --output FILE ] [ --test_file FILE ] [ --test ]

Parameter name function
lexicon the lexicon file. If omitted, the lxicon is read from STDIN.
input the rule file.
output If omitted, the resulting transducer is written to STDOUT.
test_file A file containing test-pairs for the grammar.
test Toggle test-mode. If this parameter is present, the rues won't be compiled, but tested instad.


Terms and concepts:

input string
the string to be transformed by a FST (in Xerox terminology upper string; in SFST terminology analysis string, sometimes the deep string)
output string
the string into which the FST transforms the input string (in Xerox terminology lower string; in SFST terminology surface string)
set of characters
a set of characters (in SFST terminology range but the word "range" would imply the inclusion of all members between the two extremes)
set of pairs
a subset of feasible character pairs (corresponds to the disjunction of the pairs listed in the definition).
input symbol
a token to be input to a FST; the left-hand side of a pair, i.e. a in a pair a:b


A twol-grammar consists of four parts: Alphabet, Sets, Definitions and Rules. Each part contains statements, that end in a ; character and comments, that begin with a ! character and span to the end of a line.


! The alphabet should contain all pairs used in the rules.
! Characters consist of strings of utf-8 characters. No white-space, though!
a b c d e f g h i j k l m n o p q r s t u v w x y z    N:n N:m ;

Consonant = b c d f g h j k l m n p q r s t v w x z m n ;
Vowel = a e i o u y    ;


ClosedSyllable = :Vowel+ [ ~:Vowel ]+ ;


! input/output -pairs for testing the rule-set:

!input:  k a N p a n 
!output: k a m m a n

!input:  k a N T a n 
!output: k a n n a n

!input:  k a m p i 
!output: k a m p i

"N:m before input-character p"
! A common morpho-phonetic phenomenon
N:m <=> _ p: ;

"Degradation of p to m after input-character N"
p:m <=> N: _ ;

Regular expression syntax

Any character-pair defined in the alphabet is a regular expression e.g. a or a:b. The following special pair-constructs are available:

  • a:? and a: match any pair in the alphabet having input-character a.
  • ?:a and :a match any pair in the alphabet having output-character a.
  • ? matches any pair in the alphabet.
  • ?:? same as ?.
  • 0 matches the empty string.

Concatenating pairs, one can build longer regular expressions matcing pairs of strings. Is the alphabet is declared

a N:n N:m e
then the regular expression a N: e will match the pairs of strings a N:n e and a N:m e.

Regular expressions can be grouped together using the parenthesis-constructions [ ... ] ans ( ... ). If R is a regular expression, then [ R ] matches exactly the same pairs of string as R does. The construction ( R ), on the other hand, matches the empty string, as well.

Grouping becomes important, when one uses unary regular expression operators. Unary operators like * have higher precedence, than concatenation. This means that e.g. a b* is equivalent to [ a ] [ b * ]. If one wants the * operator to apply to the whole expression a b one has to group the expressions a and b together i.e. [ a b ]*.

There are seven unary regular-expression operators in htwolc for the time being. Let the Alphabet be [ a N:n N:m o] and let R denote a regular expression. The unary operators are:

  • The power-operator ^INTEGER, which is equivalent to concatenation of the argument-expression with itself INTEGER times. E.g. a^3 is equivalent a a a.
  • The containment-operator $. The regular-expression $R matches any string containing at least one substring matched by R. E.g. $a is equivalent to [ a N:n N:m e ]* a [ a N:n N:m e]*, with the alphabet defined above.
  • The exact containment-operator $. is similar to the containment operator, but the mathcing strings have to contain exactly one substring matching R. E.g. $.a is equivalent to [ N:n N:m e ]* a [ N:n N:m e]* with the Alphabet defined above.
  • The term-complement-operator \. The term-complement of R is the language \R containing every pair, that is not matched by R. E.g. \a is equivalent to [ N:n N:m e ] with the Alphabet defined above. Note that the term-complement is not the same thing as the negation of a language.
  • The negation-operator ~. The negation of a regular-expression R contains all strings not matched by R.
  • The Kleene-star *. The language R* matches any string, that is a concatenation of any number of string from R. Note that the empty string, which is the concatenation of zero strings also matched. E.g. a* matches the empty string, a, a a, a a a and so on.
  • The plus-operator resembles the *, but it only matches strings, which are concatenation of a positive number of strings from R. Consequently R+ matches the empty string, iff R matches the empty string. E.g. a+ matches a, a a, a a a and so on.

In addition to the unary operators there are three binary operators, which may be used to build regular expressions out of existing ones. Binary operators have the lowest precedence. Hence, when using the disjunction-operation |, e.g. a b* | c d is equivalent to [ a b* ] | [ c d ] and will match anything matched by a b* or by c d. One can group expressions together so a [ b * | c ] d will match a string beginning with a followed by zero or more b symbols or a c and ending with a d.

Let R and S be regular expressions. The binary operators are:

  • The disjunction-operator |. The language R | S matches any string matched by R or S and only those.
  • The conjunction-operator &. The language R & S mathces any string matched by both R and S and only those.
  • The difference-operator -. The language R - S matches any string matched by R, but not by S and only those.

By default the binary operations bind from the left. Hence a - a - a is equivalent to [ a - a ] - a i.e. matches the empty language. If the binary operators would bind from the right, then a - a - a would be equivalent to a - [ a - a ] i.e. equivalent to a.


The operators in htwolc have different precedence. As a rule of thumb unary operators are the strongest, then concatenation and last binary operators. The constructions [ ... ] and ( ... ) override all other precedence rules.

Operators ordered by precedence from strongest to weakest:

  1. Unary operators: ^INTEGER, $, $., \, ~, *, +
  2. Concatenation
  3. Binary operators: |, & -

E.g. ~a^3 b | c d* is interpreted as

[  [ ~[ a ^ 3]  ] b ] | [ c [ d* ]  ]

The alphabet

The first part specifies the alphabet of the rules. The alphabet consists of pairs consisting of a input-character and a output-character like a:a or N:m. If the input-character and output-character are the same, it is customary to denote the pair by the input-symbol, so a:a is usually written a.

Every pair of character referred to in some of the rules, has to be declared in the alphabet. Otherwise a warning will be issued. The grammar will still be compiled, but the rules may be compiled erroneously. E.g. the any-character ? denotes any pair declared in the alphabet and only those. Hence ? won't match pairs, that aren't declared in the alphabet.

Any non-empty string of non-white-space UTF-8 characters, that isn't a reserved word, is a valid alphabet-character. For now this means, that the characters shouldn't contain newlines, spaces, tabs or carriage-returns and shouldn't be found in the section List of reserved words below.

The sets

The second part of the grammar specifies named character-ranges like

Vowel  = a e i o u y    ;

Sets may be used in rules as a short-hand for collections of character-pairs. Perhaps one might want write a rule, which states, that the phoneme t is realised as its voiced fricative counter-part ө between two phonemes, which are realised as vowels. This could be accomplished by a rule

t:&#1257; <= :Vowel _ :Vowel ;


The third part of the grammar specifies named regular expressions, which may be used as a part of definitions of rules, e.g.

ClosedSyllable = Vowel+ [ ~Vowel ]+ ;

The regular-expression syntax is the same as the syntax used in the two-level rules of the grammar. It is possible to define a named regular expression having the same name as a set or alphabet character. It will over-shadow the declaration of the set.

The rules

Error-messages and warnings

List of reserved words

Alphabet  Definitions  Rules  Sets 
!         ;            ?      :        
_         |            =>     <=        
<=>       \<=          [      ]
(         )            *      +
$         $.           ~      <
>         -            "      \
=         0           ^
The words and constructs may be used in rules by quoting with \. E.g. \? means question-mark, not any character-pair defined in the alphabet and \Sets is an ordinary name Sets not a declaration, that definitions of sets will follow. In the previous example \Sets could be used as a character in the alphabet, the name of a regular expression in the definition section of the grammar or the name of a set.

Types of rules

Ordinary twol-rules

Generalized context-restrictions

Special rule-constructs

Resolution of conflicts between the rules

A pair-string is accepted by a two-level grammar, iff it is accepted by each of the rues in the grammar. Hence there may be strings, that are accepted by some of the rules and rejected by others. While this is often intentional, there are at least two cases, where it has shown to be beneficial for the overall quality of the grammar to make some automatical modifications to the rules. These are the so called right- and left-arrow conflicts and are handled in hfst-twolc by the mechanism of conflict-resolution.

A situation, where one rule accepts a pair-string and another rejects it, shouldn't always be regarded as a conflict. In hfst-twolc it is regarded as a conflict, only if both of the rules are actually applied in the sense discussed in Yli-Jyr and Koskenniemi 2006. There are many kinds of conflicts, but for the time-being only right-arrow conflicts and left-arrow-conflicts are automatically resolved by hfst-twolc.

Unless hfst-twolc is run with the commandline-parameter --no-report, it will report all rule-conflicts, it observes and if it is run with the parameter --resolve, it will resolve the conflicts.

The examples given below of right-arrow and left-arrow conflicts are very similar to those given in Karttunen, Koskenniemi and Kaplan 1987.

Right-Arrow Conflicts

Right-arrow conflicts occur between right-arrow rules (or left-right-arrow rules) with identical centers. Consider the rules

"Rule 1"
a:b => c _ ;

"Rule 2"
a:b => d _ ;

Since Rule 1 requires, that all pairs a:b have to be preceeded by c and Rule 2, that they have to be preceeded by d, their intersection disallows all occurrences of a:b. This may be considered to be an accident.

When hfst-twolc encounters rules, that are in right-arrow-conflict, it reports and resolves the conflict There is a => conflict between the rules Rule1 and Rule2 with respect to the center a:b. Resolving the conflict by joining contexts. by collapsing the rules into a single rule a:b => c _ ; d _ ;

Left-Arrow Conflicts

Left-arrow conflicts occur between right-arrow rules, that deal with the same center-input-character, but different center-output-characters and non-disjoint contexts. Let X denote the set c d. Consider the rules

"Rule 3"
a:b <= c _ ;

"Rule 4"
a <= X _ ;

Rule 3 requires, that an input a be realised as a b following c. The problem is that Rule 4 requires, that it be realised as a following any pair in X:X, among others c. Hence the total effect of the rules is to disallow the occurrence of a pair with input-character a before the pair c.

In the example, Rule 3 may be regarded as a special case of Rule 4, since the context c _ is a sub-context of the more general X _. This might not be the case though. The contexts might be such, that neither is a sub-context of the other. This makes left-arrow-conflicts more complicated than right-arrow-conflicts.

Left-Right-Arrow Conflicts

Rules with identical centers

Rules with different centers.

A test-tool for grammars

Getting the program


Unimplemented features

This list contains features which, for the time being, are lacking from hfst-twolc, but will be added, or have been implemented differently from Xerox twolc, but will be changed.

  • Diacritics.
  • One should be able to freely insert newlines in regular expressions.
  • In rules with multiple contexts, different contexts should be separated by a semicolon ;, not by an obligatory newline.
  • Rules containing variables. E.g.
    e:&#7869; <= Nx _ Nx ; where Nx in Nasal ;
    Here Nasal is a set (in the Xerox meaning).
  • where ... ( matched | freely | mixed ) construction.
  • A default symbol for word-boundaries (not clear, whether this will be implemented or not).
  • Resolution of conflicts between the rules.

Differences from Xerox twolc

This list contains features, which are intended to differ from corresponding features in the Xerox twolc program.

  • All valid character-pairs should be declared in the Alphabet. Other character-pairs may be used in the rules, but this will raise a warning. The construction ? (and corresponding constructions) in regular expressions only matches character-pairs, which have been declared in the Alphabet.


  • A. Yli-Jyr, K. Koskenniemi, Compiling Generalized Two-Level Rules and Grammars, Advances in Natural Language Processing, Springer Berlin/Heidelberg, pages 174-185, 2006

-- MiikkaSilfverberg - 13 May 2008