hfst-twolc − A Two-Level Grammar Compiler

Purpose

Compile a two-level grammar in Xerox Twolc formalism into a weighted or unweighted HFST transducer.

Usage

USAGE: hfst-twolc [ OPTIONS ] [ GRAMMARFILE ]

Note: currently hfst-twolc is hfst-twolc.bat on Windows.

Parameters

Parameter name Meaning
-i, --input the rule file.
-o, --output If omitted, the resulting transducer is written to STDOUT.
-s, --silent Don't print any diagnostics messages.
-q, --quiet Don't print any diagnostics messages.
-R, --resolve Attempt to resolve left-arrow conflicts between rules. If omitted, left arrow conflicts aren't resolved.
-D, --dont-resolve-right Don't resolve right arrow conflicts. If omitted, right arrow conflicts are resolved.
-w, --weighted Compile the rules into weighted transducers with zero weights.
-f, --format FORMAT Store result in format FORMAT.
-v, --verbose Display detailed information concerning the compilation process.
-h, --help Display a help-message.
-u, --usage Display usage.

FORMAT may be one of openfst-log, openfst-tropical, foma or sfst. By default format is openfst-tropical.

Outline

Terms and concepts:

input string
the string to be transformed by a FST (in Xerox terminology upper string; in SFST terminology analysis string, sometimes the deep string)
output string
the string into which the FST transforms the input string (in Xerox terminology lower string; in SFST terminology surface string)
set of characters
a set of characters (in SFST terminology range but the word "range" would imply the inclusion of all members between the two extremes)
set of pairs
a subset of feasible character pairs (corresponds to the disjunction of the pairs listed in the definition).
input symbol
a token to be input to a FST; the left-hand side of a pair, i.e. a in a pair a:b

Syntax

A twol-grammar consists of five parts: Alphabet, Diacritics, Sets, Definitions and Rules. Each part contains statements, that end in a ; character and comments, that begin with a ! character and span to the end of the line. There is a fifth optional part Rule-variables, which declares variables used in the rules.

Alphabet

! The alphabet should contain all symbols which are used in the grammar.
! Symbols consist of strings of utf-8 characters. Reserved words and white-space
! need to be quoted using %.
a b c d e f g h i j k l m n o p q r s t u v w x y z    N:m N:n ;

Sets
Consonant = b c d f g h j k l m n p q r s t v w x z m n ;
Vowel = a e i o u y    ;

Definitions

ClosedSyllable = :Vowel+ [ ~:Vowel ]+ ;

Rules

"N:m before input-character p"
! A common morpho-phonetic phenomenon
N:m <=> _ p: ;

"Degradation of p to m after input-character N"
p:m <=> N: _ ;

The rules in the example grammar are from Karttunen 1992. Many of the examples in this manual are taken either from Karttunen 1992 or Karttunen and Koskenniemi 1987.

Regular Expression Syntax

Any pair of symbols defined in the alphabet is a regular expression e.g. a or a:b. The following special pair-constructs are available:

  • a:? and a: match any pair in the grammar having input-character a.
  • ?:a and :a match any pair in the grammar having output-character a.
  • ? matches any pair in the alphabet.
  • ?:? same as ?. You may also use : surrounded by white-space.
  • a:0 and 0:a correspond to deletion and insertion of a.
  • 0 matches the empty string (this is probably useless...).

Warning, important When you use constructions like :a, make sure to surround them with white-space, i.e. use ( :a) not (:a) and ( : ) not (:). Omitting white-space might break the scanning of the grammar (this might be fixed in the future).

By concatenating pairs, one can build longer regular expressions matching strings of pairs. If the alphabet is declared

Alphabet
a e N:m N:n
then the regular expression a N: e will match a N:m e and a N:n e.

Regular expressions may be grouped together using the parenthesis-constructions [ ... ] and ( ... ). If R is a regular expression, then [ R ] matches exactly the same strings of pairs as R does. The construction ( R ), on the other hand, always matches the empty string, as well. Please note, that (...) implies optionality. Expressions (...) In e.g. perl regular expression syntax correspond to [...] in twolc syntax.

Grouping becomes important, when one uses unary regular expression operators. Unary operators like * have higher precedence, than concatenation. This means that e.g. a b* is equivalent to [ a ] [ b * ]. If one wants the * operator to apply to the whole expression a b one has to group the expressions a and b together i.e. [ a b ]*.

There are seven unary regular-expression operators in hfst-twolc for the time being. Let the Alphabet be a N:n N:m o and let R denote a regular expression. The unary operators are:

  • The power-operator ^INTEGER, which is equivalent to concatenation of the argument-expression with itself INTEGER times. E.g. a^3 is equivalent to a a a.
  • The containment-operator $. The regular-expression $R matches any string containing at least one substring matched by R. E.g. $a is equivalent to [ a N:n N:m e ]* a [ a N:n N:m e]*, using the alphabet defined above.
  • The exact containment-operator $. is similar to the containment operator, but the mathcing strings have to contain exactly one substring matching R. E.g. $.a is equivalent to [ N:n N:m e ]* a [ N:n N:m e]* using the Alphabet defined above.
  • The term-complement-operator \. The term-complement of R is the language \R containing every pair, that is not matched by R. E.g. \a is equivalent to [ N:n N:m e ] with the Alphabet defined above. Note that the term-complement is not the same thing as the negation of a language.
  • The negation-operator ~. The negation of a regular-expression R contains all strings not matched by R.
  • The Kleene-star *. The language R* matches any string, which is the concatenation of any number of string from R. Note that the empty string, which is the concatenation of zero strings also matched. E.g. a* matches the empty string, a, a a, a a a and so on.
  • The plus-operator resembles *, but only matches strings, which are concatenation of a positive number of strings from R. Consequently R+ matches the empty string, iff R matches the empty string. E.g. a+ matches a, a a, a a a and so on.

In addition to unary operators there are four binary operators, which may be used to build regular expressions out of existing ones. Binary operators have the lowest precedence. Hence, e.g. a b* | c d is equivalent to [ a b* ] | [ c d ] and will match anything matched by a b* or by c d. One can group expressions together so a [ b * | c ] d will match a string beginning with a followed by zero or more b symbols or a c and ending with a d.

Let R and S be regular expressions. The binary operators are:

  • The disjunction-operator |. The language R | S matches any string matched by R or S and only those.
  • The conjunction-operator &. The language R & S matches any string matched by both R and S and only those.
  • The difference-operator -. The language R - S matches any string matched by R, but not by S and only those.
  • Ignore operator /. The language R / S matches any string which is a string in R which may have some strings in S inserted. E.g. a+/b matches e.g. aba, aa and ab.

By default the binary operations bind from the left. Hence a - a - a is equivalent to [ a - a ] - a i.e. matches the empty language. If the binary operators were to bind from the right, then a - a - a would be equivalent to a - [ a - a ] i.e. equivalent to a.

Operator Precedence

The operators in HfstTwolC have different precedence. A rule of thumb for precedence: unary operators have the strongest bind, then concatenation and finally binary operators. The constructions [ ... ] and ( ... ) override other precedences.

Operators ordered by precedence from strongest to weakest:

  1. Unary operators: ^INTEGER, $, $., \, ~, *, +
  2. Concatenation
  3. Binary operators: |, & -

E.g. ~a^3 b | c d* is interpreted as

[  [ ~[ a ^ 3]  ] b ] | [ c [ d* ]  ]

When in doubt about, which operator binds the strongest, use brackets $[ ... ]$

The Alphabet

The first part specifies the alphabet of the rules. The alphabet consists of pairs of an input-character and an output-character like a:a or N:m. If the input-character and output-character are the same, it is customary to denote the pair by the input-symbol, so a:a is usually written a. The alphabet is one statement so it is terminated by a semi-colon.

Any non-empty string of non-white-space UTF-8 characters, that isn't a reserved word, is a valid alphabet-character. For now this means, that the characters shouldn't contain newlines, spaces, tabs or carriage-returns and shouldn't be found in the section List of reserved words below.

An example of an alphabet is

Alphabet

! The alphabet should contain all symbols used in the rules.
! Characters consist of strings of utf-8 characters. No white-space, though!
a b c d e f g h i j k l m n o p q r s t u v w x y z    N:n N:m ;

The null symbol (epsilon) 0

Two-level rules express deletions using a null symbol, e.g. by e:0. A zero denotes the null symbol. Epenthesis (or insertion) is denoted, likewise, by e.g. 0:o. In two-level rules such pairs with a null symbol behave much like any other pairs.

Sometimes one needs to refer to the digit zero instead of the null symbol. The digit zero should be quoted, i.e. written as %0 in the alphabet declaration and in rules etc.

(HfstTwolC represents the null symbol (epsilon) internally as @0@, e.g. when using HfstFst2Txt, the null symbol will be displayed as @0@ and the digit zero is displayed as a plain 0. Note that the null symbol here is not the quite the same epsilon which is used in replace rules in a cascade. In two-level rules, it is a place-holder which mapped into an epsilon after the virtual intersecting of the rules.)

Implicit word-boundary .#.

Word boundaries .#. are understood to occur at the beginning and at the end of the lexical level of each string. Thus, the rules may refer to the beginning and the end of each word by writing .#. as the first or the last item in a context. These implicit boundaries are not written in the entries of HfstLexC lexicons. Note thet .#. doesn't refer to any specific symbol in HfstLexc lexicons. It signifies the absolute beginning and end of a string. You can declare your own word boundaries in HfstLexc and HfstTwolc (e.g. #), but these are just symbols.

(The implicit word boundary is represented as @#@ in rule transducers and the quoted symbol hash sign %# is represented as # in rule transducers as displayed by HfstFst2Txt. When HfstComposeIntersect combines a lexicon transducer and set of two-level rules, it inserts a @#@ at the very beginning and end of the lexicon before doing the combined operation of intersecting and composing. Thus, you do not write it explicitly it in your lexicon.)

Diacritics

The morphophonological description of a language may contain symbols, which

  • act as triggers for certain phonological rules,
  • are irrelevant for all other rules and
  • should not be present in the surface representation of word-forms.

E.g. markers for syllable-boundaries or stress markers and all kinds of markers appended to word-forms by the lexicon may be such symbols.

It's easiest to declare such symbols diacritics in hfst-twolc. This is done by mentioning them in the section Diacritics, which may look like

Diacritics

      ! The symbol . marks a syllable-boundary.
      . ;

Diacritics have the following properties

  • They always correspond to 0 on the output-side.
  • All diacritics, that aren't explicitly mentioned in a rule are invisible to that rule.

E.g. given the diacritics-declaration above and the set Vowel given in the next section, the rule

I:j <=> Vowel _ Vowel ;
allows the correspondence v i i k .:0 k o .:0 I:j a despite the intervening pair .:0.

Warning, important You shouldn't declare flag diaritics used in HfstLexC lexicons as diacritics. These are handled using a different mechanism and needn't be mentioned in the HfstTwolC grammar.

Rule-variables

This section exists, so that grammars which compiled under HfstTwolC 1.0 also compile under HfstTwolC 2.0. In HfstTwolC 1.0 rule variables had to be declared, but this isn't mandatory in HfstTwolC 2.0.

Rules may contain variables. Any variable used, can be declared in the Rule-variables section.

An example

Rule-variables

        Cx Cy Cz Vx Vy ;

Sets

The second part of the grammar specifies named character-sets like

Vowel  = a e i o u y    ;

Sets may be used in rules as a short-hand for collections of character-pairs.

Perhaps one might want to write a rule, which states, that the phoneme t is realised as its voiceless fricative counter-part ө between two phonemes, which are realised as vowels. This could be accomplished by the rule

t:&#1257; <= :Vowel _ :Vowel ;

The construction :Vowel will match any pair, used in some rule, where the output symbol is a vowel. Please, note that also pairs which result from repolacinf variables with their values in rules add to set constructions. Consider the following vowel-harmony rule regarding the archiphonemes %^A, %^O and %^U

VMP:Vx <=> BackVowel :* _ ; where VMP in (%^A %^O %^U) Vx in (a o u) matched; 
The pairs %^A:a, %^O:o and %^U:u will match :Vowel regardless of whether the the pairs have been declared in the alphabet.

It is possible to define a set having the same name as an alphabet character. There is no guarantee what will happen, if this is done.

Definitions

The third part of the grammar specifies named regular expressions, which may be used as a part of definitions of rules, e.g.

ClosedSyllable = Vowel+ [ ~Vowel ]+ ;

The regular-expression syntax is the same as the syntax used in the two-level rules of the grammar. All sets may be used in definitions and all definitions, which have been made before a particular definition, may be used as a part of that definition.

It is possible to define a named regular expression having the same name as a set or alphabet character. There is no guarantee what will happen, if this is done.

Rules

Two-level rules consist of a center, a rule-operator and contexts.

The center-language (C) is a

  • character-pair (e.g. a:b),
  • a more general pair-construct of a single character (e.g. a: or :a),
  • a set construct like a:S, where S is a symbol set,
  • or a disjunction of such centers (e.g. a:b | b: | c:d | a:S). (Note that the list may not be enclosed in brackets which were allowed when using the Xerox twolc.)

A context consists of two regular expressions (Li and Ri) separated by an underscore. Schematically

C OP L1 _ R1 ;
     L2 _ R1 ;
       ...
     Ln _ Rn ;
A rule has to have at least one context and it may have as many as are needed.

A rule with variables, is a rule, where some of the characters in character-pairs are variables, not actual alphabetical characters. A rule with variables has to have an additional so called where-part, which shows how the variables in the rule should be instantiated.

Ordinary Two-Level Rules

Two-level rules are constraints, regulating the distribution of the pairs in their center-language according to the rule-operator and contexts given. Four different kinds of rules-operators may be used in hfst-twolc

<=, =>, <=> and /<=
The final context, which is compiled into the transducer representing the two-level rule is the union of the contexts given.

Right-arrow rules constrain the distribution of a symbol-pair by specifying, that it may only occur in a specific context (or some specific contexts). Let the set V be the set of vowels in some language. An example of a right-arrow rule is

I:j => :V _ :V ;
It states, that the input-character I can be realised as j only in a contex, where it is surrounded by output vowels. The rule doesn't constrain the distribution of any other pairs I:X, nor does it constrain the distribution of pairs X:j, where X is something other than I. It simply states, that if the pair I:j occurs, it has to occur between two output vowels.

The context :V _ :V in the example is automatically extended to a so called total context, by hfst-twolc. This means that, when the rule is compiled, the context will become ?* :V _ :V ?*. This applies to all kinds of rule-operators.

Left-arrow rules constrain the set of output-characters corresponding to an input-character in some context. An example of a left-arrow rule is

N:m <= _ p: ;
It states, that an input-character N has to be realized as the output-character m if it is followed by some pair with input-character p. The rule doesn't constrain the realizations of the input-character N in any other context, than the one specified, so it never disallows any occurrences of the pair N:m. It does disallow all other pairs N:X in the context _ :p, though.

Left-arrow rules differ from right-arrow rules, because they are asymmetric with regard to the input- and output-level of pair-strings. The left-arrow example above, doesn't limit the input-character of a pair preceding p:, it only limits the output-character, if the input-character is N. Such an asymmetry is not present in left-arrow rules, which limit a particular pair into a particular kind of context.

Left/right -arrow rules, give a necessary and sufficient conditions for the realization of an input-character as some output-character. An example of a left/right -arrow rule is

K:' <=> :Vowel :a _ :a ClosedOffSet;
which states, that the morpho-phoneme K is realized as ' exactly in contexts where a vowel and an output a precede and one output a and a closed syllable-offset follows (this describes a convention of Finnish orthography stemming from consonant gradation). Any left/right arrow rule is equivalent to the joined effect of the corresponding left- and right-arrow rules. Hence the example is equivalent to the pair of rules
K:' <= :Vowel :a _ :a ClosedOffSet;
and
K:' => :Vowel :a _ :a ClosedOffSet;
Actually the alternation K:' isn't constrained to a context, where two a:s precede. It happens between any two like vowels. To describe this nicely, without using five very similar rules, one needs rule-variables, which will be presented shortly.

Prohibition rules disallow the realization of an input-character as some output-character in some contexts. Let again V denote the set of vowels. An example of a prohibition rule is

I:i /<= :V _ :V ;
which states, that the input-character I may not be realized as i between output-vowels.

Like right-arrow rules, prohibition rules are symmetric with respect to the input- and output-level of pair-strings. In fact it is often possible to state a particular constraint both as a prohibition rule concerning some pair and a left-arrow rule concerning an other. If the input-character I may only be realized as i or j, then the rules

I:i /<= :V _ :V ;
and
I:j => :V _ :V ;
state the exactly same constraint. Still, if the number of realizations is greater, it may be much easier to state the constraint using one of the operators than the other.

Rules with variables

As an easy short-hand for defining (a possibly large) set of similar two-level rules, rule-variables have been included to hfst-twolc. Consider the following rule, which is needed for gradation of stops in Finnish

"Gradation of k to '"
K:' <=> Vowel Vx _ Vx ClosedOffset ; where Vx in Vowel ;
It deals with the realization of the morpho-phoneme K, when it is the onset of a closed syllable, which is preceded by an open syllable with a two-vowel nucleus. The rule states, that K is realized as ' (a glottal stop), if the nucleus of the preceding syllable ends with the same vowel, which figures as the nucleus of the closed syllable.

The rule above couldn't be stated as a single rule, without variables, since there are no other mechanisms for specifying dependences between parts of the contexts of two-level rules. The use of the variable Vx is said to match the occurrences of the set Vowel.

It is possible to match occurrences of variables from different sets, as well. Consider the following rule, which also deals with gradation of stops in finnish

"Geminate gradation"
Cx:0 <=> :Cy _ ClosedCoda ; where Cx in ( K P T )
                                  Cy in ( k p t )
                            matched;
The rule states, that the morpho-phonemes K, P, T vanish, when they serve as the onset of a closed syllable and are preceded by a surface k, p or t respectively. Here the occurrences of the variable Cx are matched with those of Cy. For instance, nothing is said about an input K preceded by an output p. The rule is only concerned with input-level characters K preceded by output-level characters k.

Occurences of variables aren't matched by default. If you want them to be matched, you have to use the keyword matched. Other possible keywords are freely and mixed. These are most easily explained using examples.

The rule

 a:b => X _ Y; where X in (a b) Y in (a b) freely;

corresponds to the joint effect of the simple two-level rules

a:b => a _ a ;
a:b => a _ b ;
a:b => b _ a ;
a:b => b _ b ;

The rule

 a:b => X _ Y; where X in (a b) Y in (a b) mixed;

corresponds to the joint effect of the simple two-level rules

a:b => a _ b ;
a:b => b _ a ;

Rules with negative contexts

Sometimes it is easier to formulate rule contexts as the difference of two contexts. In such cases it's possible to use negative contexts.

In a description of the phonology of the Kyrgyz language, the preceding surface vowel determines the realization of the archiphoneme {A}, but the realization also depends on whether an archiphoneme {U} and an optional morpheme boundary follow. The two level rule which governs the realization of archiphoneme {A}, when there is no following archiphoneme {U}, looks like this

 "Vowel harmony for archiphoneme {A}"
{A}:Vy  <=> [ :LastVowel :Cns* [ :Cns - &#1081;: ] ] \[ :0 | %>: ] _ [ ( %>: ) [ \[ %>: | %{U%}: ] | .#. ] | %>: %>: ] ;
                                              [ %{A%}:LastVowel ] _ [ ( %>: ) [ \[ %>: | %{U%}: ] | .#. ] | %>: %>: ] ;

        where LastVowel in (  &#1080;  &#1199;  &#1077;  &#1101;  &#1257;  &#1103;  &#1072; )
              Vy in        (  &#1077;  &#1257;  &#1077;  &#1077;  &#1257;  &#1072;  &#1072; )
        matched ;
The right context [ ( %>: ) [ \[ %>: | %{U%}: ] | .#. ] | %>: %>: ] states the restriction, that the rule applies only when {A} is not followed by (an optional) >: and {U}:. The right context is quite tricky, because we have to take into account the possibility that {A} is followed by two morpheme boundaries (which could happen in some marginal cases) and that {A}: may be the last pair in a word. This context is so tricky, that it can easily be formulated incorrectly. In cases where the prohibited right context is more complicated, it will be very difficult to find the correct way to prevent it. Hence HfstTwolC allows for using negative contexts to restrict the application of twol-level rules.

Using negative contexts, the rule "Vowel harmony for archiphoneme {A}" can be formulated by referring directly to the prohibited right context [ ( %>: ) %{U%}: ]

"Vowel harmony for archiphoneme {A}"
{A}:Vy  <=> [ :LastVowel :Cns* [ :Cns - &#1081;: ] ] \[ :0 | %>: ] _ ;
            [ %{A%}:LastVowel ] _  ;

        except 
            _ ( %>: ) %{U%}: ;

        where LastVowel in (  &#1080;  &#1199;  &#1077;  &#1101;  &#1257;  &#1103;  &#1072; )
              Vy in        (  &#1077;  &#1257;  &#1077;  &#1077;  &#1257;  &#1072;  &#1072; )
        matched ;

Often negative rule contexts and conflict reolution will produce the same result, but conflict resolution requires that the conflicting rules can be formulated in such a way that they form a chain of subcases. If this is not the case, an unresolvable conflict arises and conflict resolution cannot apply. In such cases negative contexts can be used to restrict one or more of the rules so that their context becomes disjoint from the context of the conflicting rule, which means that no rule conflict arises.

The list of negative contexts should follow the list of positive contexts. The list of negative contexts has to start with the keyword except.

"Example of rule syntax. x:y in LEFT _ RIGHT except when the context is L1 _ R1 or L2 _ R2"
x:y <=> LEFT _ RIGHT ;
        
    except
        L1   _ R1    ;
        L2   _ R2    ;

Regular expression center rules

Some languages incorporate alternations, which are difficult to describe using regular twolc rules, which only concern a single symbol pair. It can e.g. be cumbersome to describe a choice of affix which is conditioned on phonological context, when the the affixes consist of multiple symbols. Such penomena are more conveniently described using rules with regular expression centers.

The following grammar describes the choice of the prefix signifying 1st person present form of a verb in Ojibwe.

Alphabet

a b c d e f g h i j k l m
n o p q r t u v w x y z zh ;

Sets

Vowel        = a e i o u y ;
AlveolarAndG = d g j z zh  ;

Definitions

! The possible first person present form markers are ind-, in-, im- and ni-.
! The choice of prefix is conditioned on phonological context.

IN  = <PRES>:i <PRES>:n <PRES>:0 ;
IM  = <PRES>:i <PRES>:m <PRES>:0 ;
IND = <PRES>:i <PRES>:n <PRES>:d ;
NI  = <PRES>:n <PRES>:i <PRES>:0 ;

! The list of all possible prefixes.
PREFIX = [ IN | IM | IND | NI ] ;

Rules

! We declare that "<PRES> <PRES> <PRES>" can only be realized as
! "i n d", "i n", "i m" or "n i".

"1st person present prefixes."
<[ PREFIX ]> <== _ ;

! The following rules restrict
!
! - "<PRES>:i <PRES>:n <PRES>:0" to contexts where an alveolar consonant or
!   "g" follows,
! - "<PRES>:i <PRES>:m <PRES>:0" to contexts where "b" follows and
! - "<PRES>:i <PRES>:n <PRES>:d" to contexts where a vowel follows.
!
! The default variant of the 1st person present marker is
! "<PRES>:n <PRES>:i <PRES>:0".

"1st person person present prefix before an alveolar consonant or g is in."
<[ IN ]> <==> _  AlveolarAndG ;

"1st person person present prefix before b in im."
<[ IM ]> <==> _ b: ;

"1st person person present prefix before a vowel is ind."
<[ IND ]> <==> _ Vowel ;
The grammar will generate the following 1st person present forms out of the verb baseforms nibaa (to sleep), anokii (to work), dagoshin (to arrive) and bakade (to be hungry).
<PRES> <PRES> <PRES> n i b a a       --> n i 0 n i b a a
<PRES> <PRES> <PRES> a n o k i i     --> i n d a n o k i i
<PRES> <PRES> <PRES> d a g o s h i n --> i n 0 d a g o s h i n
<PRES> <PRES> <PRES> b a k a d e     --> i m 0 b a k a d e

The syntax for regular expression center rules is similar to the syntax of ordinary rules, but the center needs to be enclosed in brackets <[ ... ]> and the rule operators look slightly different

==>, <==, <==>, /<==
Their semantics is the same as for ordinary twolc rules.

Regular expression center rules can be used to describe non-concatenative phenomena such as derivation of words from consonant stems in Arabic by adding vowels between the consonants. The following grammar derives two noun and two verb forms from roots consisting of three consonants. The grammar implements the example in Table 1. in Attia et al. 2011:

Root drs
Patterns R1aR2aR3a R1aR2R2aR3a R1aaR2iR3 muR1aR2R2iR3
POS V V N N
Stem darasa darrasa daaris mudarris
ENG 'study' 'teach' 'student' 'teacher'

Alphabet

!! Special symbols, which mark different derivations 
!! using root and pattern interdigitation.
!!
!! These forms correspond to the examples in Table 1 in the
!! Attia et al. SFCM 2011 article.
%<VERB1%>:0 %<NOUN1%>:0 %<VERB2%>:0 %<NOUN2%>:0

!! The different realizations for the abstract vowel <V>. It can
!! be realized as any surface vowel or epsilon.
%<V%>:e %<V%>:y %<V%>:u %<V%>:u %<V%>:i %<V%>:o %<V%>:a 
%<V%>:0

q w r t p s d f g h j k l z x c v b n m

a e i o u y 
;

Sets

!! Consonants
cns = q w r t p s d f g h j k l z x c v b n m;

Rules

!! These rules transform the underlying forms 
!!
!!          C1 <V> <V> C2 <C> C3 <V> <VERB1>
!!          C1 <V> <V> C2 <C> C3 <V> <VERB2>
!!          C1 <V> <V> C2 <C> C3 <V> <NOUN1>
!! m u <MU> C1 <V> <V> C2 <C> C3 <V> <NOUN2>  
!!
!! into surface realizations. Here C1, C2 and C3 are consonants, <C> the consonant 
!! doubling morphophoneme and <V> the abstract vowel symbol.   

!! For future improvement of hfst-twolc: Need to implement variables in 
!! regular expression center rules.

"VERB 1 rule"

!! drs -> darasa

<[ cns %<V%>:a %<V%>:0 cns %<C%>:0 %<V%>:a cns %<V%>:a ]> <==> _ %<VERB1%>:0 ;


"NOUN 1 rule"

!! drs -> daaris

<[ cns %<V%>:a %<V%>:a cns %<C%>:0 %<V%>:i cns %<V%>:0 ]> <==> _ %<NOUN1%>:0 ;


"VERB 2 rule"

!! drs -> darrasa

<[ cns %<V%>:a %<V%>:0 cns %<C%>:cns %<V%>:a cns %<V%>:a ]> <==> _ %<VERB2%>:0 ;


"NOUN 2 rule"

!! mu + drs -> mudarris

<[ cns %<V%>:a %<V%>:0 cns %<C%>:cns %<V%>:i cns %<V%>:0 ]> <==> %<MU%>:0 _ %<NOUN2%>:0 ;


"Consonant doubling"

!! %<C%> either vanishes or is realized as the same symbol as the 
!! preceding consonant.

%<C%>:CNS => CNS _ ; where CNS in cns ;

The grammar will generate the following forms from their underlying representations

         d <V> <V> r <C> <V> s <V> <NOUN1>  -->       d a a r 0 i s 0 0
m u <MU> d <V> <V> r <C> <V> s <V> <NOUN2>  --> m u 0 d a 0 r r i s 0 0
         d <V> <V> r <C> <V> s <V> <VERB1>  -->       d a a r 0 a s a 0
         d <V> <V> r <C> <V> s <V> <VERB2>  -->       d a 0 r r i s 0 0
Warning, important Regular expression center rules do not participate in conflict resolution. Rule conflicts resulting from regular expression center rules are not detected by hfst-twolc.

Warning, important If you use regular expressions with stars, plusses or pairs 0:x in the center, the rules become very difficult to understand, so it's probably best to use relatively simple center languages.

Weighted rules (NOT FULLY IMPLEMENTED YET)

It may become possible to add weights to rules, which determine the relative importance of a rule in a conflict-situation. At this time it is only possible to compile weighted rules with zero weights.

Error-Messages and Warnings

If the grammar given to hfst-twolc contains statements, which

  • don't conform to the syntax specified in this manual,
  • are illogical,
  • result in rule-transducer, whose intersection might be empty or
  • over-shadow other statements.

error messages or warnings will be issued. Statements, which make it impossible to complete the compilation of the grammar lead to error-messages and disruption of the compilation-process. Statements, that over-shadow other statements, or may lead to rule-sets whose intersection is empty lead to warning-messages.

Error-Messages

Errors in hfst-twolc are divided into two cathegories. Syntax-errors and logical errors.

Syntax Errors

A syntax-error is given, when the input-file violaes the syntax-specifications in this manual. When this happens, hfst-twolc gives an error-message and the compilation-process seizes, without writing to the output-file. An example of an syntax-related error-message is

ERROR ON LINE 79:
syntax error, unexpected CENTER_MARKER, expecting DIFFERENCE or INTERSECTION or UNION or RIGHT_SQUARE_BRACKET
Aborted.

An error-message consists of

  • the number of the line, where the error occurred,
  • a statemetn of which token caused the compilation to halt and what kind of token was expected,
  • the line, which contaied the error and
  • a marker, which points out the place, where the error occurred.

Note, that it is not always possible to say exactly where the actual error was. Sometimes even the line on which the error occurs can't be signled out.

The correspondences between tokens and token-names should be pretty clear, but here's a list

Token-name Token
ALPHABET_DECLARATION Alphabet
DIACRITICS_DECLARATION Diacritics
VARIABLE_DECLARATION Rule-Variables
DEFINITION_DECLARATION Definitions
SETS_DECLARATION Sets
RULES_DECLARATION Rules
WHERE where
MATCHED matched
MIXED mixed
IN in
NEWLINE A newline.
RULE_NAME A quoted string of characters (except ").
AND and
STAR *
PLUS +
COMPLEMENT ~
TERM_COMPLEMENT \
FREELY_INSERT /
CONTAINMENT_ONCE $.
CONTAINMENT $
ANY ?
UNION
|
INTERSECTION &
POWER ^
DIFFERENCE -
NUMBER A positive or negative integer.
EPSILON 0
LEFT_SQUARE_BRACKET [
RIGHT_SQUARE_BRACKET ]
LEFT_BRACKET (
RIGHT_BRACKET
)
LEFT_RESTRICTION_ARROW
/<=
LEFT_ARROW
<=
RIGHT_ARROW
=>
LEFT_RIGHT_ARROW
<=>
PAIR_SEPARATOR_BOTH A : preceded by white-space and followed by something, that isn't a SYMBOL.
PAIR_SEPARATOR_RIGHT A : preceeded by white-space and followed by a SYMBOL.
PAIR_SEPARATOR_LEFT A : preceeded by a SYMBOL and followed by something, that isn't a SYMBOL.
PAIR_SEPARATOR A : preceeded and followed by a SYMBOL.
EOL ;
EQUALS
=
CENTER_MARKER _
SYMBOL A sequence of characters, where every special-character (i.e. one with a special meaning like [, ;, or %) has been quoted. A symbol may not contain newlines!

Logical Errors

hfst-twolc currently only gives two kinds of logical error.

Symbols, which are used in rules, but not declared in the alphabet, give a logical error.

Let a grammar contain the following rule

"Geminate gradation"
Cx:0 <=> :Cy _ ClosedCoda ; where Cx in ( K P T )
                                  Cy in ( k p )
                            matched;
Here the sets ( K P T ) and ( k p ) are of unequal length, so it is impossible to match the variables Cx and Cy. An error-meesge is issued
ERROR ON LINE 87:
Cx and Cy can't be matched since they correspond to lists of un-equal lengths!
                            matched;
                                    ^ HERE
Aborted.

Resolution of Conflicts between the Rules

A pair-string is accepted by a two-level grammar, iff it is accepted by each of the rules in the grammar. Hence there may be strings, that are accepted by some of the rules and rejected by others. While this is often intentional, there are at least two cases, where it has shown to be beneficial for the overall quality of the grammar to make some automatical modifications to the rules. These so called right- and left-arrow conflicts are handled by the mechanism of conflict-resolution in hfst-twolc.

A situation, where one rule accepts a pair-string and another rejects it, shouldn't always be regarded as a conflict. In hfst-twolc it is regarded as a conflict, only if both of the rules are actually applied in the sense discussed in Yli-Jyr and Koskenniemi 2006. Normal rule-interaction constrains the surface-realizations of some input-form, but do not loose all of them. In contrast to this rule-conflicts often filter away some input-forms completely. There are many kinds of conflicts, but for the time-being only right-arrow conflicts and left-arrow-conflicts are automatically resolved by hfst-twolc.

Unless hfst-twolc is run with the commandline-parameter --silent, it will report all rule-conflicts. It always resolves right-arrow conflicts and it resolves left-arrow conflicts if it is run with the parameter --resolve.

The examples given below of right-arrow and left-arrow conflicts are very similar to those given in Karttunen, Koskenniemi and Kaplan 1987.

Right-Arrow Conflicts

Right-arrow conflicts occur between right-arrow rules (or left-right-arrow rules) with identical centers. Consider the rules

"Rule 1"
a:b => c _ ;

"Rule 2"
a:b => d _ ;

Since Rule 1 requires, that all pairs a:b have to be preceeded by c and Rule 2, that they have to be preceeded by d, their intersection disallows all occurrences of a:b. This may be considered to be an accident.

When hfst-twolc encounters rules, that are in right-arrow-conflict, it reports and resolves the conflict There is a => conflict between the rules Rule1 and Rule2 with respect to the center a:b. Resolving the conflict by joining contexts. by collapsing the rules into a single rule a:b => c _ ; d _ ;

Left-Arrow Conflicts

Left-arrow conflicts occur between left-arrow rules, that deal with the same center-input-character, but different center-output-characters and non-disjoint contexts. Let X denote the set c d. Consider the rules

"Rule 3"
a:b <= c _ ;

"Rule 4"
a <= X _ ;

Rule 3 requires, that an input a be realised as a b following c. The problem is that Rule 4 requires, that it be realised as a following any pair in X:X, among others c. Hence the total effect of the rules is to disallow the occurrence of a pair with input-character a before the pair c.

In the example, Rule 3 may be regarded as a special case of Rule 4, since the context c _ is a sub-context of the more general X _. This might not be the case though. The contexts might be such, that neither is a sub-context of the other. This makes left-arrow-conflicts more complicated than right-arrow-conflicts.

The approach taken in hfst-twolc is to warn about all left-arrow conflicts, but only fix those left-arrow conflicts, where one of the rules is a special case of the other. The conflict is fixed by modifying the more general rule so, that it only applies in contexts, where the more specific rule doesn't apply. In the example above, the resolution-process doesn't effect Rule 3, but changes Rule 4, so that it becomes equivalent with the rule

 
a <= d _ ;

Left/Right -Arrow Conflicts

Besides left- and right-arrow conflicts, there are other kinds of unfortunate interactions between rules. Currently hfst-twolc neither reports, nor fixes such interactions, which makes it important for the grammar-writer to be aware of the possibility of them. Left/right -arrow conflicts involve operators of different types and come in two flavors.

Rules with Identical Centers

Consider the rules

a:b => c _ ;
and
a:b <= d _ ;
The first rule requires, that the a:b pair is immediately preceded by the pair c. The second rule requires, that a be realised as b always when it is preceded by d. Together the rules prohibit the occurrence of an input-character a before the input-character d.

Rules with Different Centers.

Consider the rules

a:b => c _ ;
and
a <= c _ ;
These rules together prohibit the occurrence of the pair a:b anywhere, since a has to be realized as a after c, but this is the only position, where a could be realised as b.

List of Reserved Words

Alphabet  Definitions  Rules  Sets 
!         ;            ?      :        
_         |            =>     <=        
<=>       /<=          [      ]
(         )            *      +
$         $.           ~      <
>         -            "      \
=         0           ^       #
%
The words and constructs may be used in rules by quoting with %. E.g. %? means question-mark, not any character-pair defined in the alphabet and %Sets is an ordinary name Sets not a declaration, that definitions of sets will follow. In the previous example %Sets could be used as a character in the alphabet, the name of a regular expression in the definition section of the grammar or the name of a set.

Warning, important HfstTwolC reserves symbol-names beginning with two underscores for internal use.

Known bugs

  • Warnings for unequal value lists in variable-rules with keyword matched and mixed aren't working correctly. This may result in the compiler getting stuck in an endless loop. This is going to get fixed.
  • The word-boundary symbol # needs to be declared separately in the alphabet (or as a diacritic). This is going to get fixed.

References

  • A. Yli-Jyr, K. Koskenniemi, Compiling Generalized Two-Level Rules and Grammars, Advances in Natural Language Processing, Springer Berlin/Heidelberg, pages 174-185, 2006

  • M. Attia, P. Pecina, A. Toral, L. Tounsi and J. Van Genabith, A Lexical Database for Modern Standard Arabic Interoperable with a Finite State Morphological Transducer, Proceedings of the Second Workshop on Systems and Frameworks for Computational Morphology, Springer, 2011.

Obtaining the program and installing

hfst-twolc is a part of HfstCommandLineTools.