hfst-xfst

Purpose

Parse weighted Xerox XFST scripts or execute XFST commands in interactive mode. Input files are always treated as UTF-8.

Usage

The help message:

Usage: hfst-xfst [OPTIONS...]
Compile XFST scripts or execute XFST commands interactively

Common options:
  -h, --help             Print help message
  -V, --version          Print version info
  -v, --verbose          Print verbosely while processing
  -q, --quiet            Only print fatal erros and requested output
  -s, --silent           Alias of --quiet

Xfst-specific options:
  -e, --execute=CMD        Execute command CMD on startup
  -f, --format=FMT         Write result using FMT as backend format
  -F, --scriptfile=FILE    Read commands from FILE, and quit
  -l, --startupfile=FILE   Read commands from FILE on startup
  -p, --pipe-mode[=STREAM] Control input and output streams
  -r, --no-readline        Do not use readline library for input
  -w, --print-weight       Print weights for each operation

Option --execute can be invoked many times.
If FMT is not given, OpenFst's tropical format will be used.
The possible values for FMT are { foma, openfst-tropical, openfst-log, sfst }.
Readline library, if enabled when configuring, is used for input by default.
Input files are always treated as UTF-8.

STREAM can be { input, output, both }. If not given, defaults to {both}.
If input file is not specified with -F, input is read interactively line by
line from the user. If you redirect input from a file, use --pipe-mode=input.
--pipe-mode=output is ignored on non-windows platforms.

Report bugs to <hfst-bugs@helsinki.fi> or directly to our bug tracker at:
<https://sourceforge.net/tracker/?atid=1061990&group_id=224521&func=browse>

The XFST formalism

The formalism used is the same as described in the book Finite State Morphology (Beesley & Karttunen 2003) with added support for weights. The book's web page includes code examples, answers to frequently asked questions and errata. For more information about the commands and regular expression syntax, see chapter 3 of the book.

Operator precedence

There have been some changes in operator precedence in different releases of Xerox's finite-state tools. Note that the current version of Xerox tools (released in 2011) does not support expressions such as \a:b any more, as does not HFST either. (The crossproduct operator outranks the term complement, so brackets are needed for correct interpretation: [\a]:b.)

Latest release of HFST (3.8.3) still supports expressions where the question mark has been omitted on either side of the colon. However, Xerox has redefined the colon as a high-precedence cross-product operator, so the tools work differently at the moment:

expression HFST interpretation Xerox interpretation
a: b a:? b syntax error
a :b a ?:b syntax error
a : b a ?:? b a:b

Starting from the next release, HFST will adhere to the Xerox formalism, so omitting the question mark will no longer be supported. Support for omitting the question mark has already been removed from the code in the svn repository starting from revision 4437.

Examples

Here is an example of an interactive session with hfst-xfst. We first set variable print-weight ON, so that weights will be shown. We can achieve the same with option --print-weight. We can even specify the precision at which the weights are shown.

hfst[0]: set print-weight ON
hfst[0]: set precision 6

We create a transducer that recognizes "cat", "dog" and "mouse" with respective weights 1, 2 and 3. It becomes the topmost transducer in the stack.

hfst[0]: regex {cat}::1|{dog}::2|{mouse}::3;
10 states, 11 arcs

We then print all strings recognized by the transducer.

hfst[1]: print words
cat     1.000000
dog     2.000000
mouse   3.000000

Next we save the stack (in this case, only one transducer) to file animals.hfst and clear it.

hfst[1]: save stack animals.hfst
hfst[1]: clear stack

We load the saved stack.

hfst[0]: load stack < animals.hfst
10 states, 11 arcs

We then create another transducer which recognizes "elephant" with weight 4. It becomes the topmost transducer in the stack.

hfst[1]: regex {elephant}::4;
9 states, 8 arcs

We disjunct the two topmost transducers. As a result, they are popped from the stack and the resulting disjunction is pushed to the stack.

hfst[2]: disjunct
16 states, 18 arcs

Finally, we print again all the words recognized by the topmost transducer.

hfst[1]: print words
cat     1.000000
dog     2.000000
mouse   3.000000
elephant 4.000000

Weights

Weights can be defined in regular expressions and they are also supported by all relevant XFST operations. Weights are not shown by default, set them on with command set print-weight ON or option --print-weight or -w. Note that all weights will be zero if FORMAT is other that openfst-tropical (the default format).

Weights are not shown in print net as the XFST net formalism does not support them, use write att or write prolog instead.

Weights in regular expressions

The original unweighted format for regular expression for describing two-level automata is detailed in the book "Finite-State Morphology" by Kenneth R. Beesley and Lauri Karttunen. Hfst-xfst basically uses the same regexp syntax with added support for weights.

Weights in rules

At least the following rules support weights: ->, ->, @->, and @>. Weights can be defined on left, right or both sides of the rule. Weights in contexts are ignored. Note: weighted cyclic rules often result in transducers that are not determinizable. In that case, you must specify that weights are encoded when performing determinization/minimization:

set encode-weights ON

The example below maps one or more a:s into a single A between two c:s so that each a adds a weight 1 and the resulting A a weight 2.

hfst[0]: regex [a::1]+ -> [A::2] || c __ c ;
4 states, 14 arcs
hfst[1]: apply up
apply up> cac
cAc     3.00000
apply up> caac
cAc     4.00000
apply up> caaac
cAc     5.00000

And the same with an optional mapping:

hfst[0]: regex [a::1]+ (->) [A::2] || c __ c ;
3 states, 11 arcs
hfst[1]: apply up
apply up> cac
cac     0.00000
cAc     3.00000
apply up> caac
caac    0.00000
cAc     4.00000
apply up> caaac
caaac   0.00000
cAc     5.00000

This example performs shortest match mapping of any number of a:s into a single A between c [a*] and [a*] c: Note how shortest mapping picks the longest contexts so that a maximum number of possible mappings gets done and, as a result, the weight is biggest.

hfst[0]: regex [a::1]+ @> [A::2] || c [a*] __ [a*] c ;
4 states, 14 arcs
hfst[1]: apply up
apply up> cac
cAc     3.00000
apply up> caac
cAAc    6.00000
apply up> caaac
cAAAc   9.00000

The example below shows how to use parallel rules with weights. Let's assume that there is a language where a most likely gets mapped into b between two c:s. There is also another mapping, a goes into B, which is not as likely. It is also possible that a stays as a, but this is the least probable case. This feature of the language could be described as follows:

regex a -> b::1 || c _ c ,, a -> B::2 || c _ c ,, a -> a::3 || c _ c ;

If we perform lookup for string acacaca, we get the results

acbcbca 2.00000
acBcbca 3.00000
acbcBca 3.00000
acBcBca 4.00000
acacbca 4.00000
acbcaca 4.00000
acBcaca 5.00000
acacBca 5.00000
acacaca 6.00000

Note that the initial and final a:s are not changed since their context does not match to the rules. The most probable output is the one where all the a:s are transformed into b:s and the least probable the one where no transformation has happened.

Optimized lookup

While the transducer format (defined with --format FORMAT) cannot be other than {openfst-tropical, openfst-log, sfst, foma}, hfst-xfst offers a limited support for HFST's optimized lookup format (OL). Three basic operations also work for OL transducers:

operation explanation
load stack < filename read transducers from file and place them to the stack
save stack filename save stack to a file
apply up (foo) look up word(s) in the topmost transducer in the stack

In addition, two functions not present in the original XFST formalism are added:

operation explanation
lookup-optimize convert all transducers in stack to optimized lookup format
remove-optimization convert all transducers in stack to the common format (defined with --format FORMAT)

If the transducers in stack are in OL format, all other operations will just give a warning message instructing the user to call remove-optimization.

Commands and variables

Commands and variables in hfst-xfst

Below is a list of XFST/foma commands implemented in hfst-xfst with short explanations.

command explanation note
apply down enter apply down mode (Ctrl-D exits)
apply up enter apply up mode (Ctrl-D exits)
apply down apply down to the top network on stack
apply up apply up to the top network on stack
apropos search help for
clear stack clears the stack
compact sigma removes "redundant" symbols from FSM
complete net completes the FSM
compose net composes networks on stack
concatenate net concatenates networks on stack
crossproduct net cross-product of top two FSMs on stack
define <r.e.> define a network
define (<v1,..,vn>) <r.e.> define function
determinize net determinizes top FSM on stack
echo echo a string
eliminate flag eliminate flag diacritics from the top network
eliminate flags eliminate all flag diacritics from the top network not in Xerox's XFST
epsilon-remove net remove all epsilon transitions from top FSM on stack not in foma
help print help message about a command
ignore net applies ignore to top two FSMs on stack
inspect net traverse interactively through top FSM on stack not in foma
intersect net intersects FSMs on stack
invert net inverts top FSM
label net extracts all attested symbol pairs from FSM
load defined Restores defined networks from file
load stack Loads networks and pushes them on the stack
lookup-optimize convert all transducers in stack to optimized lookup format
lower-side net takes lower projection of top FSM
minimize net minimizes top FSM
minus net subtraction of top two FSMs
name net names top FSM
negate net complements top FSM
one-plus net Kleene plus on top FSM
pop stack remove top FSM from stack
print defined prints defined symbols and functions
print directory print the contents of current directory
print labels print label pairs of top FSM not in foma
print label-tally print statistics about labels of top FSM not in foma
print longest-string print longest string recognized by top FSM not in foma
print longest-string-size print length of longest string recognized by top FSM not in foma
print lower-words (> filename) prints words on the lower side of top FSM (if cirular, follows max 1-2 loops)
print name prints the name of the top FSM
print net prints all(?) information about top FSM
print random-lower (N=15) (> filename) prints N random words from lower side
print random-upper (N=15) (> filename) prints N random words from upper side
print random-words (N=15) (> filename) prints random words from top FSM
print shortest-string (> filename) prints the shortest string of the top FSM (upper and lower side)
print shortest-string-size (> filename) prints length of shortest string (upper and lower side)
print sigma prints the alphabet of the top FSM
print stack print information about FSMs on the stack not in foma
print upper-words (> filename) prints words on the upper side of top FSM (if circular, follows max 1-2 loops)
print words (> filename) prints words of top FSM
prune net makes top network coaccessible
push (defined) (name) adds a defined FSM (or all, if no name given) to top of stack
quit exit program
read att read a file in AT&T FSM format and add to top of stack
read lexc read and compile lexc format file
read prolog reads prolog format file
(read) regex <r.e> compile regular expression and place it on the stack
read spaced-text compile space-separated words/word-pairs separated by newlines into an FST
read text compile a list of words separated by newlines into an automaton
remove-optimization convert all transducers in stack to the common format (defined with --format FORMAT)
reverse net reverses top FSM
rotate stack rotates stack
save defined save all defined networks to binary file
save stack save stack to binary file
set <ON/OFF> sets a global variable (see show variables)
show shows the value of the variable
shuffle net asynchronous product on top two(?) FSMs on stack
sigma net Extracts the alphabet and creates a FSM that accepts all single symbols in it
source read and compile script file
substitute defined X for Y substitutes defined network X at all arcs containing Y
substitute label A:B C:D E:F ... for X:Y substitutes label X:Y with A:B C:D E:F ... not in foma
substitute symbol X Y Z ... for A substitutes all occurrences of A in an arc with X Y Z ...
system execute a system command
test equivalent test if the top two FSMs are equivalent
test identity test if top FST represents identity relations only not in Xerox's XFST
test lower-bounded if look down is infinitely ambiguous
test lower-universal test if lower side is Σ*
test non-null test if top machine is not the empty language
test null test if top machine is the empty language (∅)
test overlap test if two top FSMs overlap not in foma
test sublanguage test if top FSM is a sublanguage of next FSM on stack not in foma
test upper-bounded if look up is infinitely ambiguous
test upper-universal test if upper side is Σ*
turn stack turns stack upside down
undefine remove from defined networks
union net union of top two FSMs
upper-side net upper projection of top FSM
write att (> filename) writes top network to AT&T format file/stdout
write prolog (> filename)? writes top network to prolog format file/stdout
zero-plus net Kleene star on top fsm

variable explanation note
assert quit if a test fails and quit-on-fail is ON not in foma
char-encoding character encoding used always UTF-8, cannot be changed
copyright-owner copyright owner Copyleft (c) University of Helsinki
encode-weights encode weights when minimizing
flag-is-epsilon treat flag diacritics as epsilons in composition The default is OFF.
harmonize-flags harmonize flag diacritics before composition
hopcroft-min ON = Hopcroft minimization, OFF = Brzozowski minimization The default is ON. Not in Xerox's XFST
lexc-minimize-flags if 'lexc-with-flags' == ON, minimize number of flags The default is OFF.
lexc-with-flags use flags to hyperminimize result from lexc files The default is OFF.
lookup-cycle-cutoff
obey-flags obey flag diacritics in `apply'
maximum-weight maximum weight allowed for paths printed in 'apply'
minimal minimize resulting FSMs
name-nets stores the name of the network when using 'define'
precision number of decimals printed for weights
print-foma-sigma print identities as '@'
print-pairs always print both sides when applying
print-space print spaces between symbols
print-sigma print the alphabet when printing network
print-weight show weights when printing words or networks
print-words-cycle-cutoff maximum number of times to follow loops when printing words
quit-on-fail Abort operations when encountering errors
quote-special Print special characters in double quotes
show-flags show flag diacritics in `apply'
verbose Verbosity of interface
xerox-composition treat flag diacritics as ordinary symbols in composition In Xerox, this is always the case.

Commands and variables not implemented in hfst-xfst

Commands and variables implemented in XFST (and possibly in foma) but missing from hfst-xfst:

{ add properties, alias, cleanup net, collect epsilon-loops, compile-replace lower, compile-replace upper, edit properties, list, print aliases, print arc-tally, print file-info, print flags, print label-maps, print list, print lists, print sigma-tally, print sigma-word-tally, print size, read properties, sort net, substring net, twosided flag-diacritics, unlist, write definition, write definitions, write dot, write properties, write spaced-text, write text }

{ char-encoding, directory, flag-is-epsilon, name-nets, random-seed, recode-cp1252, recursive-define, retokenize, sort-arcs, use-timer }