OMorFi: htwolc -- An Open-source Two-level Grammar Compiler

Usage

htwolc [ --lexicon FILE ] --input FILE [ --output FILE ] [ --test_file FILE ] [ --test ]

Parameter name function
lexicon the lexicon file. If omitted, the lxicon is read from STDIN.
input the rule file.
output If omitted, the resulting transducer is written to STDOUT.
test_file A file containing test-pairs for the grammar.
test Toggle test-mode. If this parameter is present, the rues won't be compiled, but tested instad.

Outline

Terms and concepts:

input string
the string to be transformed by a FST (in Xerox terminology upper string; in SFST terminology analysis string, sometimes the deep string)
output string
the string into which the FST transforms the input string (in Xerox terminology lower string; in SFST terminology surface string)
set of characters
a set of characters (in SFST terminology range but the word "range" would imply the inclusion of all members between the two extremes)
set of pairs
a subset of feasible character pairs (corresponds to the disjunction of the pairs listed in the definition).
input symbol
a token to be input to a FST; the left-hand side of a pair, i.e. a in a pair a:b

Syntax

A twol-grammar consists of four parts: Alphabet, Sets, Definitions and Rules. Each part contains statements, that end in a ; character and comments, that begin with a ! character and span to the end of a line.

Alphabet

! The alphabet should contain all pairs used in the rules.
! Characters consist of strings of utf-8 characters. No white-space, though!
a b c d e f g h i j k l m n o p q r s t u v w x y z    N:n N:m ;

Sets
Consonant = b c d f g h j k l m n p q r s t v w x z m n ;
Vowel = a e i o u y    ;

Definitions

ClosedSyllable = :Vowel+ [ ~:Vowel ]+ ;

Rules

! input/output -pairs for testing the rule-set:

!input:  k a N p a n 
!output: k a m m a n

!input:  k a N T a n 
!output: k a n n a n

!input:  k a m p i 
!output: k a m p i

"N:m before input-character p"
! A common morpho-phonetic phenomenon
N:m <=> _ p: ;

"Degradation of p to m after input-character N"
p:m <=> N: _ ;

Regular expression syntax

Any character-pair defined in the alphabet is a regular expression e.g. a or a:b. The following special pair-constructs are available:

  • a:? and a: match any pair in the alphabet having input-character a.
  • ?:a and :a match any pair in the alphabet having output-character a.
  • ? matches any pair in the alphabet.
  • ?:? same as ?.
  • 0 matches the empty string.

Concatenating pairs, one can build longer regular expressions matcing pairs of strings. Is the alphabet is declared

Alphabet
a N:n N:m e
then the regular expression a N: e will match the pairs of strings a N:n e and a N:m e.

Regular expressions can be grouped together using the parenthesis-constructions [ ... ] ans ( ... ). If R is a regular expression, then [ R ] matches exactly the same pairs of string as R does. The construction ( R ), on the other hand, matches the empty string, as well.

Grouping becomes important, when one uses unary regular expression operators. Unary operators like * have higher precedence, than concatenation. This means that e.g. a b* is equivalent to [ a ] [ b * ]. If one wants the * operator to apply to the whole expression a b one has to group the expressions a and b together i.e. [ a b ]*.

There are seven unary regular-expression operators in htwolc for the time being. Let the Alphabet be [ a N:n N:m o] and let R denote a regular expression. The unary operators are:

  • The power-operator ^INTEGER, which is equivalent to concatenation of the argument-expression with itself INTEGER times. E.g. a^3 is equivalent a a a.
  • The containment-operator $. The regular-expression $R matches any string containing at least one substring matched by R. E.g. $a is equivalent to [ a N:n N:m e ]* a [ a N:n N:m e]*, with the alphabet defined above.
  • The exact containment-operator $. is similar to the containment operator, but the mathcing strings have to contain exactly one substring matching R. E.g. $.a is equivalent to [ N:n N:m e ]* a [ N:n N:m e]* with the Alphabet defined above.
  • The term-complement-operator \. The term-complement of R is the language \R containing every pair, that is not matched by R. E.g. \a is equivalent to [ N:n N:m e ] with the Alphabet defined above. Note that the term-complement is not the same thing as the negation of a language.
  • The negation-operator ~. The negation of a regular-expression R contains all strings not matched by R.
  • The Kleene-star *. The language R* matches any string, that is a concatenation of any number of string from R. Note that the empty string, which is the concatenation of zero strings also matched. E.g. a* matches the empty string, a, a a, a a a and so on.
  • The plus-operator resembles the *, but it only matches strings, which are concatenation of a positive number of strings from R. Consequently R+ matches the empty string, iff R matches the empty string. E.g. a+ matches a, a a, a a a and so on.

In addition to the unary operators there are three binary operators, which may be used to build regular expressions out of existing ones. Binary operators have the lowest precedence. Hence, when using the disjunction-operation |, e.g. a b* | c d is equivalent to [ a b* ] | [ c d ] and will match anything matched by a b* or by c d. One can group expressions together so a [ b * | c ] d will match a string beginning with a followed by zero or more b symbols or a c and ending with a d.

Let R and S be regular expressions. The binary operators are:

  • The disjunction-operator |. The language R | S matches any string matched by R or S and only those.
  • The conjunction-operator &. The language R & S mathces any string matched by both R and S and only those.
  • The difference-operator -. The language R - S matches any string matched by R, but not by S and only those.

By default the binary operations bind from the left. Hence a - a - a is equivalent to [ a - a ] - a i.e. matches the empty language. If the binary operators would bind from the right, then a - a - a would be equivalent to a - [ a - a ] i.e. equivalent to a.

Precedence

The operators in htwolc have different precedence. As a rule of thumb unary operators are the strongest, then concatenation and last binary operators. The constructions [ ... ] and ( ... ) override all other precedence rules.

Operators ordered by precedence from strongest to weakest:

  1. Unary operators: ^INTEGER, $, $., \, ~, *, +
  2. Concatenation
  3. Binary operators: |, & -

E.g. ~a^3 b | c d* is interpreted as

[  [ ~[ a ^ 3]  ] b ] | [ c [ d* ]  ]

The alphabet

The first part specifies the alphabet of the rules. The alphabet consists of pairs consisting of a input-character and a output-character like a:a or N:m. If the input-character and output-character are the same, it is customary to denote the pair by the input-symbol, so a:a is usually written a.

Every pair of character referred to in some of the rules, has to be declared in the alphabet. Otherwise a warning will be issued. The grammar will still be compiled, but the rules may be compiled erroneously. E.g. the any-character ? denotes any pair declared in the alphabet and only those. Hence ? won't match pairs, that aren't declared in the alphabet.

Any non-empty string of non-white-space UTF-8 characters, that isn't a reserved word, is a valid alphabet-character. For now this means, that the characters shouldn't contain newlines, spaces, tabs or carriage-returns and shouldn't be found in the section List of reserved words below.

The sets

The second part of the grammar specifies named character-ranges like

Vowel  = a e i o u y    ;

Sets may be used in rules as a short-hand for collections of character-pairs. Perhaps one might want write a rule, which states, that the phoneme t is realised as its voiced fricative counter-part ө between two phonemes, which are realised as vowels. This could be accomplished by a rule

t:&#1257; <= :Vowel _ :Vowel ;

Definitions

The third part of the grammar specifies named regular expressions, which may be used as a part of definitions of rules, e.g.

ClosedSyllable = Vowel+ [ ~Vowel ]+ ;

The regular-expression syntax is the same as the syntax used in the two-level rules of the grammar. It is possible to define a named regular expression having the same name as a set or alphabet character. It will over-shadow the declaration of the set.

The rules

Error-messages and warnings

List of reserved words

Alphabet  Definitions  Rules  Sets 
!         ;            ?      :        
_         |            =>     <=        
<=>       \<=          [      ]
(         )            *      +
$         $.           ~      <
>         -            "      \
=         0           ^
The words and constructs may be used in rules by quoting with \. E.g. \? means question-mark, not any character-pair defined in the alphabet and \Sets is an ordinary name Sets not a declaration, that definitions of sets will follow. In the previous example \Sets could be used as a character in the alphabet, the name of a regular expression in the definition section of the grammar or the name of a set.

Types of rules

Ordinary twol-rules

Generalized context-restrictions

Special rule-constructs

Resolution of conflicts between the rules

A test-tool for grammars

Getting the program

Installing

Unimplemented features

This list contains features which, for the time being, are lacking from hfst-twolc, but will be added, or have been implemented differently from Xerox twolc, but will be changed.

  • Diacritics.
  • One should be able to freely insert newlines in regular expressions.
  • In rules with multiple contexts, different contexts should be separated by a semicolon ;, not by an obligatory newline.
  • Rules containing variables. E.g.
    e:&#7869; <= Nx _ Nx ; where Nx in Nasal ;
    Here Nasal is a set (in the Xerox meaning).
  • where ... ( matched | freely | mixed ) construction.
  • A default symbol for word-boundaries (not clear, whether this will be implemented or not).
  • Resolution of conflicts between the rules.

Differences from Xerox twolc

This list contains features, which are intended to differ from corresponding features in the Xerox twolc program.

  • All valid character-pairs should be declared in the Alphabet. Other character-pairs may be used in the rules, but this will raise a warning. The construction ? (and corresponding constructions) in regular expressions only matches character-pairs, which have been declared in the Alphabet.

References


-- MiikkaSilfverberg - 13 May 2008