HFST: Tutorial for Interactive HFST Tools

Interactive HFST tools include hfst-xfst, hfst-proc and hfst-lookup.

Examples

All tools can be invoked from command line and they take input from user by default. Pressing Ctrl+C exits the program.

hfst-lookup

hfst-lookup is the simplest of the tools, it basically looks up words in a transducer file FILE as shown below (lines beginning with > are user input):

hfst-lookup FILE
> cat
cat     chat    1.000000

> dog
dog     chien   2.000000

> DOG
DOG     DOG+?   inf

> mouse
mouse   mouse+? inf

For each word, the tool prints the results it can find in the transducer file and their weights. If a word is not found, +? is appended to the result and an infinite weight inf is printed.

hfst-proc

hfst-proc is a similar tool, but designed for text streams.

echo "Cat Dog mouse, cat dog." | hfst-proc FILE --show-weights
^Cat/Chat~1~$ ^Dog/Chien~2~$ ^mouse/*mouse$, ^cat/chat~1~$ ^DOG/CHIEN~2~$.

Note that the tool by default recognizes words in upper and lower case. Weights are printed only if --show-weights is used, however the CG format does not support weights. The output format can be controlled through options:

option explanation
-p --apertium Apertium output format for analysis (default)
-C --cg Constraint Grammar output format for analysis
-x, --xerox Xerox output format for analysis

The apertium format keeps all punctuation and whitespace characters as they were in the input, the CG and Xerox formats discard them:

echo "Cat Dog mouse, cat DOG." | hfst-proc FILE --cg --print-weights
"<Cat>"
        "chat"
"<Dog>"
        "chien"
"<mouse>"
        "*mouse"
"<cat>"
        "chat"
"<DOG>"
        "chien"

echo "Cat Dog mouse, cat DOG." | hfst-proc FILE --xerox --print-weights
Cat             1
Cat     Chat

Dog             2
Dog     Chien

mouse   +?

cat             1
cat     chat

DOG             2
DOG     CHIEN

hfst-xfst

hfst-xfst is the most complex of these three tools. Below is a simple example of looking up words using hfst-xfst, lines beginning with prompt hfst[N] or are user input. Note that option --print-weight must be specified if we want to see weights. Pressing Ctrl+D in apply up mode exits that mode and returns to the normal mode, where pressing Ctrl+C or writing exit exits the program.

hfst-xfst --print-weight
hfst[0]: load stack FILE
hfst[1]: apply up
apply up> cat
chat    1.00000
apply up> dog
chien   2.00000
apply up> Dog
???
apply up> mouse
???
apply up> [user presses Ctrl+D here]
hfst[1]: exit

You can find more hfst-xfst examples here.

Optimized lookup (OL) format

There is a special HFST transducer format designed for fast look up, the optimized lookup (OL) format. hfst-proc only supports transducers in OL format. hfst-lookup supports transducers in other formats, but is much faster with OL transducers. hfst-xfst offers many operations, most of which are not implemented for OL format. However, the look up operation also works with OL transducers.

How to know in which format a transducer is

The tool hfst-format can be used:

hfst-format transducer.ofst
Transducers in transducer.ofst are of type OpenFST, std arc, tropical semiring
hfst-format transducer.ol
Transducers in transducer.ol are of type Hfst's lookup optimized, weighted

How to convert between transducer formats

The tool hfst-fst2fst can be used:

hfst-fst2fst --format optimized-lookup-weighted transducer.ofst > transducer.ol
hfst-fst2fst --format openfst-tropical transducer.ol > transducer.ofst

In hfst-xfst, there are special commands to convert between formats.

Interactive vs. non-interactive mode

All three tools can be given user input through standard input (the default) or a file (must be specified with an option or command line parameter):

option/parameter explanation
hfst-xfst --scriptfile=FILE Read commands from FILE, and quit
hfst-xfst --startupfile=FILE Read commands from FILE on startup
hfst-proc transducer_file [input_file] Read input from input_file
hfst-lookup --input-strings=SFILE Read lookup strings from SFILE

In case of reading user input from standard input, the tools hfst-xfst and hfst-lookup make a difference between interactive mode (the default) and pipe mode. This is because both tools print a prompt in interactive mode unless option --silent or --quiet is used. An example with hfst-xfst in interactive mode:

$ hfst-xfst
hfst[0]: regex foo:bar::3;
2 states, 1 arcs
hfst[1]:

and the same in pipe mode:

$ echo "regex foo:bar::3;" | hfst-xfst --pipe-mode
2 states, 1 arcs

In hfst-xfst, there is an apply up mode where user can look up words in a transducer and exit from that mode by pressing Ctrl+D. As this is difficult with scripts (and possibly in Windows Command Prompt in general), there is a special string <ctrl-d> reserved for this purpose. Below is an example in interactive mode:

$ hfst-xfst
hfst[0]: regex foo:bar;
2 states, 1 arcs
hfst[1]: apply up
apply up> foo
bar
apply up> [user presses ctrl+d here]
hfst[1]: echo done
done
hfst[1] exit
.

and the same in pipe mode:

$ echo -e "regex a:b;
apply up
a
<ctrl-d>
echo done
exit" | hfst-xfst --pipe-mode
2 states, 1 arcs
b
done

-- ErikAxelson - 2014-02-24