hfst-twolc − A Two-Level Grammar Compiler

Purpose

Compile a two-level grammar in Xerox Twolc formalism into a weighted or unweighted HFST transducer.

Usage

USAGE: hfst-twolc [ OPTIONS ] [ GRAMMARFILE ]

Parameters

Parameter name Meaning
-i, --input the rule file.
-o, --output If omitted, the resulting transducer is written to STDOUT.
-s, --silent Don't print any diagnostics messages.
-q, --quiet Don't print any diagnostics messages.
-r, --resolve Attempt to resolve conflicts between rules. If omitted, conflicts aren't resolved.
-N, --names If this option is given, the names of the rules in the grammar file are saved in a file .names. This file may given as paramteter to the utility HfstPairTest.
-w, --weighted Compile the rules into weighted transducers with zero weights.
-v, --verbose Display detailed information concerning the compilation process.
-h, --help Display a help-message.
-u, --usage Display usage.

Outline

Terms and concepts:

input string
the string to be transformed by a FST (in Xerox terminology upper string; in SFST terminology analysis string, sometimes the deep string)
output string
the string into which the FST transforms the input string (in Xerox terminology lower string; in SFST terminology surface string)
set of characters
a set of characters (in SFST terminology range but the word "range" would imply the inclusion of all members between the two extremes)
set of pairs
a subset of feasible character pairs (corresponds to the disjunction of the pairs listed in the definition).
input symbol
a token to be input to a FST; the left-hand side of a pair, i.e. a in a pair a:b

Syntax

A twol-grammar consists of five parts: Alphabet, Diacritics, Sets, Definitions and Rules. Each part contains statements, that end in a ; character and comments, that begin with a ! character and span to the end of the line.

Alphabet

! The alphabet should contain symbols which are used in the grammar.
! Characters consist of strings of utf-8 characters. No white-space, though!
a b c d e f g h i j k l m n o p q r s t u v w x y z    N:m N:n ;

Sets
Consonant = b c d f g h j k l m n p q r s t v w x z m n ;
Vowel = a e i o u y    ;

Definitions

ClosedSyllable = :Vowel+ [ ~:Vowel ]+ ;

Rules

"N:m before input-character p"
! A common morpho-phonetic phenomenon
N:m <=> _ p: ;

"Degradation of p to m after input-character N"
p:m <=> N: _ ;

The rules in the example grammar are from Karttunen 1992. Many of the examples in this manual are taken either from Karttunen 1992 or Karttunen and Koskenniemi 1987.

Regular Expression Syntax

Any pair of symbols defined in the alphabet is a regular expression e.g. a or a:b. The following special pair-constructs are available:

  • a:? and a: match any pair in the grammar having input-character a.
  • ?:a and :a match any pair in the grammar having output-character a.
  • ? matches any pair in the alphabet.
  • ?:? same as ?. You may also use : surrounded by white-space.
  • a:0 and 0:a correspond to deletion and insertion of a.
  • 0 matches the empty string (this is probably useless...).

Warning, important When you use constructions like :a, make sure to surround them with white-space, i.e. use ( :a) not (:a) and ( : a) not (: a). Omitting white-space will break the scanning of the grammar (this might be fixed in the future).

By concatenating pairs, one can build longer regular expressions matching strings of pairs. If the alphabet is declared

Alphabet
a e N:m N:n
then the regular expression a N: e will match a N:m e and a N:n e.

Regular expressions may be grouped together using the parenthesis-constructions [ ... ] and ( ... ). If R is a regular expression, then [ R ] matches exactly the same strings of pairs as R does. The construction ( R ), on the other hand, always matches the empty string, as well.

Grouping becomes important, when one uses unary regular expression operators. Unary operators like * have higher precedence, than concatenation. This means that e.g. a b* is equivalent to [ a ] [ b * ]. If one wants the * operator to apply to the whole expression a b one has to group the expressions a and b together i.e. [ a b ]*.

There are seven unary regular-expression operators in hfst-twolc for the time being. Let the Alphabet be = a N:n N:m o= and let R denote a regular expression. The unary operators are:

  • The power-operator ^INTEGER, which is equivalent to concatenation of the argument-expression with itself INTEGER times. E.g. a^3 is equivalent to a a a (NOT IMPLEMENTED for some reason... coming soon).
  • The containment-operator $. The regular-expression $R matches any string containing at least one substring matched by R. E.g. $a is equivalent to [ a N:n N:m e ]* a [ a N:n N:m e]*, using the alphabet defined above.
  • The exact containment-operator $. is similar to the containment operator, but the mathcing strings have to contain exactly one substring matching R. E.g. $.a is equivalent to [ N:n N:m e ]* a [ N:n N:m e]* using the Alphabet defined above.
  • The term-complement-operator \. The term-complement of R is the language \R containing every pair, that is not matched by R. E.g. \a is equivalent to [ N:n N:m e ] with the Alphabet defined above. Note that the term-complement is not the same thing as the negation of a language.
  • The negation-operator ~. The negation of a regular-expression R contains all strings not matched by R.
  • The Kleene-star *. The language R* matches any string, which is the concatenation of any number of string from R. Note that the empty string, which is the concatenation of zero strings also matched. E.g. a* matches the empty string, a, a a, a a a and so on.
  • The plus-operator resembles *, but only matches strings, which are concatenation of a positive number of strings from R. Consequently R+ matches the empty string, iff R matches the empty string. E.g. a+ matches a, a a, a a a and so on.

In addition to unary operators there are three binary operators, which may be used to build regular expressions out of existing ones. Binary operators have the lowest precedence. Hence, e.g. a b* | c d is equivalent to [ a b* ] | [ c d ] and will match anything matched by a b* or by c d. One can group expressions together so a [ b * | c ] d will match a string beginning with a followed by zero or more b symbols or a c and ending with a d.

Let R and S be regular expressions. The binary operators are:

  • The disjunction-operator |. The language R | S matches any string matched by R or S and only those.
  • The conjunction-operator &. The language R & S matches any string matched by both R and S and only those.
  • The difference-operator -. The language R - S matches any string matched by R, but not by S and only those.

By default the binary operations bind from the left. Hence a - a - a is equivalent to [ a - a ] - a i.e. matches the empty language. If the binary operators were to bind from the right, then a - a - a would be equivalent to a - [ a - a ] i.e. equivalent to a.

Operator Precedence

The operators in htwolc have different precedence. A rule of thumb for precedence: unary operators have the strongest bind, then concatenation and finally binary operators. The constructions [ ... ] and ( ... ) override other precedences.

Operators ordered by precedence from strongest to weakest:

  1. Unary operators: ^INTEGER, $, $., \, ~, *, +
  2. Concatenation
  3. Binary operators: |, & -

E.g. ~a^3 b | c d* is interpreted as

[  [ ~[ a ^ 3]  ] b ] | [ c [ d* ]  ]

The Alphabet

The first part specifies the alphabet of the rules. The alphabet consists of pairs of an input-character and an output-character like a:a or N:m. If the input-character and output-character are the same, it is customary to denote the pair by the input-symbol, so a:a is usually written a. The alphabet is one statement so it is terminated by a semi-colon.

Every symbol referred to in some of the rules, has to be declared in the alphabet. Otherwise an error message will be issued.

Any non-empty string of non-white-space UTF-8 characters, that isn't a reserved word, is a valid alphabet-character. For now this means, that the characters shouldn't contain newlines, spaces, tabs or carriage-returns and shouldn't be found in the section List of reserved words below.

An example of an alphabet is

Alphabet

! The alphabet should contain all symbols used in the rules.
! Characters consist of strings of utf-8 characters. No white-space, though!
a b c d e f g h i j k l m n o p q r s t u v w x y z    N:n N:m ;

Diacritics

The morpho-phonological description of a language may contain symbols, which

  • act as ques for certain phonological rules to act,
  • are irrelevant for all other rules and
  • should not be present in the phonological representation of word-forms.

E.g. markers for syllable-boundaries and all kinds of markers appended to word-forms by the lexicon may be such symbols.

It's easiest to declare such symbols diacritics in hfst-twolc. This is done by mentioning them in the section Diacritics, which may look like

Diacritics

      ! The symbol . marks a syllable-boundary.
      . ;

Diacritics have the following properties

  • They always correspond to 0 on the output-side.
  • All diacritics, that aren't explicitly mentioned in a rule are invisible to that rule.

E.g. given the diacritics-declaration above and the set Vowel given in the next section, the rule

I:j <=> Vowel _ Vowel ;
allows the correspondence v i i k .:0 k o .:0 I:j a despite the intervening pair .:0.

Rule-variables

This section exists, so that grammars which compiled under HfstTwolC 1.0 also compile under HfstTwolC 2.0. In HfstTwolC 1.0 rule variables needed to be declared, but this isn't madatory in HfstTwolC 2.0.

Rules may contain variables. Any variable used, can be declared in the Rule-variables section.

An example

Rule-variables

        Cx Cy Cz Vx Vy ;

Sets

The second part of the grammar specifies named character-sets like

Vowel  = a e i o u y    ;

Sets may be used in rules as a short-hand for collections of character-pairs.

Perhaps one might want to write a rule, which states, that the phoneme t is realised as its voiced fricative counter-part ө between two phonemes, which are realised as vowels. This could be accomplished by the rule

t:&#1257; <= :Vowel _ :Vowel ;

The construction :Vowel will match any pair, used in some rule, where the output symbol is a vowel.

It is possible to define a set having the same name as an alphabet character. There is no guarantee what will happen, if this is done.

Definitions

The third part of the grammar specifies named regular expressions, which may be used as a part of definitions of rules, e.g.

ClosedSyllable = Vowel+ [ ~Vowel ]+ ;

The regular-expression syntax is the same as the syntax used in the two-level rules of the grammar. All sets may be used in definitions and all definitions, which have been made before a particular definition, may be used as a part of that definition.

It is possible to define a named regular expression having the same name as a set or alphabet character. There is no guarantee what will happen, if this is done.

Rules

Two-level rules consist of a center, a rule-operator and contexts.

The center-language (C) is a

  • character-pair (e.g. a:b),
  • a more general pair-construct of a single character (e.g. a: or :a),
  • a set construct like a:S, where S is a symbol set,
  • or a disjunction of such centers (e.g. a:b | b: | c:d | a:S).

A context consists of two regular expressions (Li and Ri) separated by an underscore. Schematically

C OP L1 _ R1 ;
     L2 _ R1 ;
       ...
     Ln _ Rn ;
A rule has to have at least one context and it may have as many as are needed.

A rule with variables, is a rule, where some of the characters in character-pairs are variables, not actual alphabetical characters. A rule with variables has to have an additional so called where-part, which shows how the variables in the rule should be instantiated.

Ordinary Two-Level Rules

Two-level rules are constraints, regulating the distribution of the pairs in their center-language according to the rule-operator and contexts given. Four different kinds of rules-operators may be used in hfst-twolc

<=, =>, <=> and /<=
The final context, which is compiled into the transducer representing the two-level rule is the union of the contexts given.

Right-arrow rules constrain the distribution of a symbol-pair by specifying, that it may only occur in a specific context (or some specific contexts). Let the set V be the set of vowels in some language. An example of a right-arrow rule is

I:j => :V _ :V ;
It states, that the input-character I can be realised as j only in a contex, where it is surrounded by output vowels. The rule doesn't constrain the distribution of any other pairs I:X, nor does it constrain the distribution of pairs X:j, where X is something else than I. It simply states, that if the pair I:j occurs, it has to occur between two output vowels.

The context :V _ :V in the example is automatically extended to a so called total context, by hfst-twolc. This means that, when the rule is compiled, the context will become ?* :V _ :V ?*. This applies to all kinds of rule-operators.

Left-arrow rules constrain the set of output-characters corresponding to an input-character in some context. An example of a left-arrow rule is

N:m <= _ p: ;
It states, that an input-character N has to be realized as the output-character m if it is followed by some pair with input-character p. The rule doesn't constrain the realizations of the input-character N in any other context, than the one specified, so it never disallows any occurrences of the pair N:m. It does disallow all other pairs N:X in the context _ :p, though.

Left-arrow rules differ from right-arrow rules, because they are asymmetric with regard to the input- and output-level of pair-strings. The left-arrow example above, doesn't limit the input-character of a pair preceding p:, it only limits the output-character, if the input-character is N. Such an asymmetry is not present in left-arrow rules, which limit a particular pair into a particular kind of context.

Left/right -arrow rules, give a necessary and sufficient conditions for the realization of an input-character as some output-character. An example of a left/right -arrow rule is

K:' <=> :Vowel :a _ :a ClosedOffSet;
which states, that the morpho-phoneme K is realized as ' exactly in contexts where a vowel and an output a precede and one output a and a closed syllable-offset follows (this describes a convention of Finnish orthography stemming from consonant gradation). Any left/right arrow rule is equivalent to the joined effect of the corresponding left- and right-arrow rules. Hence the example is equivalent to the pair of rules
K:' <= :Vowel :a _ :a ClosedOffSet;
and
K:' => :Vowel :a _ :a ClosedOffSet;
Actually the alternation K:' isn't constrained to a context, where two a:s precede. It happens between any two like vowels. To describe this nicely, without using five very similar rules, one needs rule-variables, which will be presented shortly.

Prohibition rules disallow the realization of an input-character as some output-character in some contexts. Let again V denote the set of vowels. An example of a prohibition rule is

I:i /<= :V _ :V ;
which states, that the input-character I may not be realized as i between output-vowels.

Like right-arrow rules, prohibition rules are symmetric with respect to the input- and output-level of pair-strings. In fact it is often possible to state a particular constraint both as a prohibition rule concerning some pair and a left-arrow rule concerning an other. If the input-character I may only be realized as i or j, then the rules

I:i /<= :V _ :V ;
and
I:j => :V _ :V ;
state the exactly same constraint. Still, if the number of realizations is greater, it may be much easier to state the constraint using one of the operators than the other.

Rules with variables

As an easy short-hand for defining (a possibly large) set of similar two-level rules, rule-variables have been included to hfst-twolc. Consider the following rule, which is needed for gradation of stops in finnish

"Gradation of k to '"
K:' <=> Vowel Vx _ Vx ClosedOffset ; where Vx in Vowel ;
It deals with the realization of the morpho-phoneme K, when it is the onset of a closed syllable, which is preceded by an open syllable with a two-vowel nucleus. The rule states, that K is realized as ' (a glottal stop), if the nucleus of the preceding syllable ends with the same vowel, which figures as the nucleus of the closed syllable.

The rule above couldn't be stated as a single rule, without variables, since there are no other mechanisms for specifying dependences between parts of the contexts of two-level rules. The use of the variable Vx is said to match the occurrences of the set Vowel.

It is possible to match occurrences of variables from different sets, as well. Consider the following rule, which also deals with gradation of stops in finnish

"Geminate gradation"
Cx:0 <=> :Cy _ ClosedCoda ; where Cx in ( K P T )
                                  Cy in ( k p t )
                            matched;
The rule states, that the morpho-phonemes K, P, T vanish, when they serve as the onset of a closed syllable and are preceded by a surface k, p or t respectively. Here the occurrences of the variable Cx are matched with those of Cy. For instance, nothing is said about an input K preceded by an output p. The rule is only concerned with input-level characters K preceded by output-level characters k.

Occurences of variables are matched by default. If you don't want this to happen, you may either use several where parts to govern different variables or replace the keyword matched by freely.

Generalized Context-Restrictions (NOT IMPLEMENTED YET)

Generalized context-restrictions allow the definition of rules with a more general center-language, than normal two-level rules. They also let the user constrain the application of a particular rule to some contexts.

Weighted rules (NOT IMPLEMENTED YET)

It may become possible to add weights to rules, which determine the relative importance of a rule in a conflict-situation.

Error-Messages and Warnings

If the grammar given to hfst-twolc contains statements, which

  • don't conform to the syntax specified in this manual,
  • are illogical,
  • result rule-transducer, whose intersection might be empty or
  • over-shadow other statements.

error messages or warnings will be issued. Statements, which make it impossible to complete the compilation of the grammar lead to error-messages and disruption of the compilation-process. Statements, that over-shadow other statements, or may lead to rule-sets whose intersection is empty lead to warning-messages.

Error-Messages

Errors in hfst-twolc are divided into two cathegories. Syntax-errors and logical errors.

Syntax Errors

A syntax-error is given, when the input-file violaes the syntax-specifications in this manual. When this happens, hfst-twolc gives an error-message and the compilation-process seizes, without writing to the output-file. An example of an syntax-related error-message is

ERROR ON LINE 79:
syntax error, unexpected CENTER_MARKER, expecting DIFFERENCE or INTERSECTION or UNION or RIGHT_SQUARE_BRACKET
Cx:Cy <=>  [ h | Liquid | Vowel:   _ Vowel: Cons: [ Cons: | #:0 ] ;
                                    ^ HERE
Aborted.

An error-message consists of

  • the number of the line, where the error occurred,
  • a statemetn of which token caused the compilation to halt and what kind of token was expected,
  • the line, which contaied the error and
  • a marker, which points out the place, where the error occurred.

Note, that it is not always possible to say exactly where the actual error was. Sometimes even the line on which the error occurs can't be signled out.

The correspondences between tokens and token-names should be pretty clear, but here's a list

Token-name Token
ALPHABET_DECLARATION Alphabet
DIACRITICS_DECLARATION Diacritics
VARIABLE_DECLARATION Rule-Variables
DEFINITION_DECLARATION Definitions
SETS_DECLARATION Sets
RULES_DECLARATION Rules
WHERE where
MATCHED matched
MIXED mixed
IN in
NEWLINE A newline.
RULE_NAME A quoted string of characters (except ").
AND and
STAR *
PLUS +
COMPLEMENT ~
TERM_COMPLEMENT \
CONTAINMENT_ONCE $.
CONTAINMENT $
ANY ?
UNION = =
INTERSECTION &
POWER ^
DIFFERENCE -
NUMBER A positive or negative integer.
EPSILON 0
LEFT_SQUARE_BRACKET [
RIGHT_SQUARE_BRACKET ]
LEFT_BRACKET (
RIGHT_BRACKET
)
LEFT_RESTRICTION_ARROW
/<=
LEFT_ARROW
<=
RIGHT_ARROW
=>
LEFT_RIGHT_ARROW
<=>
PAIR_SEPARATOR_BOTH A : preceded by white-space and followed by something, that isn't a SYMBOL.
PAIR_SEPARATOR_RIGHT A : preceeded by white-space and followed by a SYMBOL.
PAIR_SEPARATOR_LEFT A : preceeded by a SYMBOL and followed by something, that isn't a SYMBOL.
PAIR_SEPARATOR A : preceeded and followed by a SYMBOL.
EOL ;
EQUALS
=
CENTER_MARKER _
SYMBOL A sequence of characters, where every special-character (i.e. one with a special meaning like [, ;, or %) has been quoted. A symbol may not contain newlines!

Logical Errors

hfst-twolc currently only gives one kind of logical error. Let a grammar contain the following rule

"Geminate gradation"
Cx:0 <=> :Cy _ ClosedCoda ; where Cx in ( K P T )
                                  Cy in ( k p )
                            matched;
Here the sets ( K P T ) and ( k p ) are of unequal length, so it is impossible to match the variables Cx and Cy. An error-meesge is issued
ERROR ON LINE 87:
Cx and Cy can't be matched since they correspond to lists of un-equal lengths!
                            matched;
                                    ^ HERE
Aborted.

Warnings

The following is an example of a warning given by hfst-twolc

WARNING! LINE 7:
[1] The pair a:b wasn't declared in the alphabet!
a:b <= c _ ;
  ^ HERE
The program attempts to report the number of the line, which gives the warning and also point to the place, which gives the warning. Note, that the place and line given may not be accurate. When they're not, the problem is often on the previous line.

The number [1] means, that this is a warning of type 1. There are seven types of warnings. These are

[1] A pair X:Y was used in the grammar, but it wasn't declared in the alphabet and neither X, nor Y was the name of a set.

[2] The same set is defined twice, or a set is defined, which has the same name as a symbol in the alphabet.

[3] The same definition is declared twice, or a defintiion has the same name as a set or a symbol in the alphabet.

[4] The same rule-name is used for two rules.

[5] A construct X: or :X was used, where X is a symbol or a set. The expression didn't match a single pair in the alphabet.

[6] The construction R^i was used, where i was not a positive integer.

[7] Warning for a pair x:y where x is a diacritic and y is non-zero. Diacritics are always realised as zero, so y will be discarded.

Resolution of Conflicts between the Rules

A pair-string is accepted by a two-level grammar, iff it is accepted by each of the rules in the grammar. Hence there may be strings, that are accepted by some of the rules and rejected by others. While this is often intentional, there are at least two cases, where it has shown to be beneficial for the overall quality of the grammar to make some automatical modifications to the rules. These so called right- and left-arrow conflicts are handled by the mechanism of conflict-resolution in hfst-twolc.

A situation, where one rule accepts a pair-string and another rejects it, shouldn't always be regarded as a conflict. In hfst-twolc it is regarded as a conflict, only if both of the rules are actually applied in the sense discussed in Yli-Jyr and Koskenniemi 2006. Normal rule-interaction constrains the surface-realizations of some input-form, but do not loose all of them. In contrast to this rule-conflicts often filter away some input-forms completely. There are many kinds of conflicts, but for the time-being only right-arrow conflicts and left-arrow-conflicts are automatically resolved by hfst-twolc.

Unless hfst-twolc is run with the commandline-parameter --no-report, it will report all rule-conflicts, it observes and if it is run with the parameter --resolve, it will resolve the conflicts.

The examples given below of right-arrow and left-arrow conflicts are very similar to those given in Karttunen, Koskenniemi and Kaplan 1987.

Right-Arrow Conflicts

Right-arrow conflicts occur between right-arrow rules (or left-right-arrow rules) with identical centers. Consider the rules

"Rule 1"
a:b => c _ ;

"Rule 2"
a:b => d _ ;

Since Rule 1 requires, that all pairs a:b have to be preceeded by c and Rule 2, that they have to be preceeded by d, their intersection disallows all occurrences of a:b. This may be considered to be an accident.

When hfst-twolc encounters rules, that are in right-arrow-conflict, it reports and resolves the conflict There is a => conflict between the rules Rule1 and Rule2 with respect to the center a:b. Resolving the conflict by joining contexts. by collapsing the rules into a single rule a:b => c _ ; d _ ;

Left-Arrow Conflicts

Left-arrow conflicts occur between left-arrow rules, that deal with the same center-input-character, but different center-output-characters and non-disjoint contexts. Let X denote the set c d. Consider the rules

"Rule 3"
a:b <= c _ ;

"Rule 4"
a <= X _ ;

Rule 3 requires, that an input a be realised as a b following c. The problem is that Rule 4 requires, that it be realised as a following any pair in X:X, among others c. Hence the total effect of the rules is to disallow the occurrence of a pair with input-character a before the pair c.

In the example, Rule 3 may be regarded as a special case of Rule 4, since the context c _ is a sub-context of the more general X _. This might not be the case though. The contexts might be such, that neither is a sub-context of the other. This makes left-arrow-conflicts more complicated than right-arrow-conflicts.

The approach taken in hfst-twolc is to warn about all left-arrow conflicts, but only fix those left-arrow conflicts, where one of the rules is a special case of the other. The conflict is fixed by modifying the more general rule so, that it only applies in contexts, where the more specific rule doesn't apply. In the example above, the resolution-process doesn't effect Rule 3, but changes Rule 4, so that it becomes equivalent with the rule

 
a <= d _ ;

Left/Right -Arrow Conflicts

Besides left- and right-arrow conflicts, there are other kinds of unfortunate interactions between rules. Currently hfst-twolc neither reports, nor fixes such interactions, which makes it important for the grammar-writer to be aware of the possibility of them. Left/right -arrow conflicts involve operators of different types and come in two flavors.

Rules with Identical Centers

Consider the rules

a:b => c _ ;
and
a:b <= d _ ;
The first rule requires, that the a:b pair is immediately preceded by the pair c. The second rule requires, that a be realised as b always when it is preceded by d. Together the rules prohibit the occurrence of an input-character a before the input-character d.

Rules with Different Centers.

Consider the rules

a:b => c _ ;
and
a <= c _ ;
These rules together prohibit the occurrence of the pair a:b anywhere, since a has to be realized as a after c, but this is the only position, where a could be realised as b.

List of Reserved Words

Alphabet  Definitions  Rules  Sets 
!         ;            ?      :        
_         |            =>     <=        
<=>       /<=          [      ]
(         )            *      +
$         $.           ~      <
>         -            "      \
=         0           ^       #
%
The words and constructs may be used in rules by quoting with %. E.g. %? means question-mark, not any character-pair defined in the alphabet and %Sets is an ordinary name Sets not a declaration, that definitions of sets will follow. In the previous example %Sets could be used as a character in the alphabet, the name of a regular expression in the definition section of the grammar or the name of a set.

A Test-Tool for Grammars

Differences from Xerox twolc

This section contains a list of features, which differ between hfst-twolc and Xerox twolc.

Unimplemented Features in hfst-twolc

This list contains features which, for the time being, are lacking from hfst-twolc, but will be added, or have been implemented differently from Xerox twolc, but will be changed. The missing features are gathered from Karttunen and Koskenniemi 1987 and Karttunen 1992.

Partial implementations in hfst-twolc

Since this is an alpha-version of hfst-twolc, there are many features, that have limited functionality.

The where ... ( matched | freely | mixed ) construction is implemented, but is partial in some respects. You can either write a rule with a variable Vx

"Gradation of k to '"
%^K:' <=> Vowel Vx _ Vx ClosedOffset ;
             where Vx in Vowel ;
or write
"Gradation of k to '"
%^K:' <=> Vowel Vx _ Vx ClosedOffset ;
             where Vx in ( a e i o u y   ) ;
but you can't embed the Vowel set in the range, i.e. rules like
"Gradation of k to '"
%^K:' <=> Vowel Vx _ Vx ClosedOffset ;
             where Vx in (Vowel) ;
don't work.

There is no support for either the freely or mixed options. E.g.

X:Y => a _ ; where X in (s t) Y in (u v);
means the same as
X:Y => a _ ; where X in (s t) Y in (u v) matched;
i.e. is equivalent to the intersection of the rules
s:u => a _ ;
t:v => a _ ; 
Though there is no support for freely, the option can easily be simulated by writing the rule
X:Y => a _ ; where X in (s t) and Y in (u v);
This makes the rule equivalent to the intersection of the rules
s:u => a _ ;
t:v => a _ ;
s:v => a _ ;
t:u => a _ ; 

Rules defined with variables, may easily come into conflict with eachother. For now this is treated as any other rule-conflict. Consider the rule

x:y => A _ A ; where A in (s t);
The subcases
x:y => s _ A ;
x:y => t _ A ;
are in a right-arrow conflict with each-other. This is easily solved by conflict-resolution. The case of left-arrow rules is less fortunate. They may easily come into unresolvable conflict with each-other, when the center involves variables.

Conflict-resolution may be very slow.

Substitution of values for variables may produce new pairs , which haven't been declared in the alphabet. For now hfst-twolc can only warn about such new pairs occuring on the left side of the rule-operator.

The OpenFst-implementation may be very slow.

Permanent differences from Xerox twolc

This list contains features, which are intended to differ from corresponding features in the Xerox twolc program.

  • All valid character-pairs should be declared in the Alphabet. Other character-pairs may be used in the rules, but this will raise a warning. The construction ? (and corresponding constructions) in regular expressions only matches character-pairs, which have been declared in the Alphabet.
  • All rule-variables have to be declared in the Rule-variables section in the header of the grammar.

References

  • A. Yli-Jyr, K. Koskenniemi, Compiling Generalized Two-Level Rules and Grammars, Advances in Natural Language Processing, Springer Berlin/Heidelberg, pages 174-185, 2006

Obtaining the program and installing

hfst-twolc is a part of HfstCommandLineTools.


-- MiikkaSilfverberg - 13 May 2008