HFST: Command Line Tools

HFST tools is a collection of HFST-based command line utilities that can create, operate and print transducers using the HFST interface. The tools are licenced under GNU GPL version 3 (other licences may be available at request). Licence text can be found from file COPYING. Other licences are possible, and can be given by authors found in AUTHORS file.

Downloading and Installation

Tools can be fetched from our download page. We offer Debian packages for Linux, a Windows installer as well as a Macport installation. It is also possible to compile the tools from source.

For installing from scratch, see instructions in INSTALL. Briefly, the usual ./configure && make &&  (sudo) make install should result in a local installation and make uninstall in its uninstallation. If you would rather install in eg. your home directory (or aren't the system administrator), you can tell ./configure: ./configure --prefix=${HOME}.

Getting started

  • A tutorial with simple examples.
  • Get familiar with the different functionalities offered by the HFST tools.
  • Examples of HFST command line tools are given in tool-specific wiki pages.
  • The rest of this page lists the HFST tools and their purposes and gives information on parameters and formats recognized by the tools.

Command Line Utilities

Tool Purpose
hfst-affix-guessify Create weighted affix guesser from automaton.
hfst-calculate An alias for hfst-sfstpl2fst.
hfst-compare Compare two transducer inputs for equivalence.
hfst-compose Compose two transducer inputs pairwise.
hfst-compose-intersect Compute the intersecting composition of a lexicon transducer and rule transducers.
hfst-concatenate Concatenate two transducer inputs pairwise.
hfst-conjunct Conjunct (intersect) two transducer inputs pairwise.
hfst-determinize Determinize transducer input.
hfst-disjunct Disjoin (calculate the union of) two transducer inputs pairwise.
hfst-duplicate Use first transducer of an archive repeatedly.
hfst-edit-metadata Set values of properties in transducer headers.
hfst-foma Wrapper around foma. Native HFST tool is hfst-xfst.
hfst-format Give the implementation format of transducer input.
hfst-fst2fst Convert between HFST, OpenFst, SFST and foma transducers.
hfst-fst2strings Display the strings recognized by a transducer.
hfst-fst2txt Print transducer in AT&T tabular format.
hfst-grep Search for PATTERN in each FILE or standard input.
hfst-guess Use a guesser (and generator) to guess analyses or inflectional paradigms of unknown words.
hfst-guessify Compile a morphological analyzer into a guesser and generator.
hfst-head Take N first transducers in transducer input.
hfst-info Print known data of HFST library.
hfst-invert Invert each transducer in input.
hfst-lexc-wrapper A wrapper for foma's lexc, the native tool is hfst-lexc.
hfst-lexc Compile lexicon files in Xerox Lexc formalism into an HFST transducer.
hfst-lookup Fast look-up of strings in a transducer.
hfst-minimize Minimize transducer input.
hfst-name Name or print the name of each transducer in input.
hfst-optimized-lookup Run a transducer on standard input (one word per line) and print analyses. More efficient version of hfst-lookup.
hfst-ospell Spell check using HFST finite-state automata.
hfst-pair-test Test a Twol rule file using correspondences of strings.
hfst-pmatch Perform matching/transformation on text streams with a RTN system.
hfst-pmatch2fst Compile regular expressions into transducer(s) for use with hfst-pmatch.
hfst-proc Perform morphological analysis and generation with finite state transducers.
hfst-project Project a transducer towards input or output level.
hfst-prune-alphabet Remove symbols from the alphabet of a transducer that do not occur in any of the transitions.
hfst-push-weights Push weights of a transducer towards initial or final state(s).
hfst-regexp2fst Convert regular expression(s) into transducer output.
hfst-remove-epsilons Remove epsilons from transducer input.
hfst-repeat Repeat a transducer from N to M times.
hfst-reverse Reverse each transducer in input.
hfst-reweight Reweight transducer weights simply.
hfst-sfstpl2fst Compile files in SFST programming language into HFST transducers.
hfst-shuffle Shuffle two transducers.
hfst-split Write each transducer in the input into a separate file.
hfst-strings2fst Compile string pairs and pair-strings into transducers.
hfst-subtract Subtract pairwise two transducer inputs.
hfst-substitute Substitute transition(s) in each input transducer with another transition(s) or a transducer.
hfst-summarize Print general information of a transducer.
hfst-tag Tag a text file using an hfst tagger.
hfst-tail Take N last transducers in the input.
hfst-train-tagger Compile training data file into an hfst part-of-speech tagger.
hfst-traverse Walk through the transducer arc by arc.
hfst-twolc Compile a two-level grammar in Xerox Twolc formalism into an HFST transducer.
hfst-txt2fst Convert AT&T tabular format into a binary transducer.
hfst-xfst Compile XFST scripts or use XFST commands in interactive mode.

Usage

hfst-toolname [OPTIONS] [FILE...]

HFST tools contain number of different command line utilities, and their parameters vary on case by case basis. If in doubt, parameter --help will always show the parameters of a tool. For further instructions and examples, see tool-specific wiki pages listed here.

Common parameters

Parameters common for all commandline programs.

-h, --help Print help message
-V, --version Print version info
-v, --verbose Print verbosely while processing
-q, --quiet Do not print output
-s, --silent Alias of --quiet

Parameters common for all commandline programs taking one input stream and writing transducers or text as output.

-i, --input=FILENAME Read input from FILENAME
-o, --output=FILENAME Write output to FILENAME

If output parameter is not given, the transducer or text output will be written to standard output stream. That is, following are equivalent in terms of output processing:

hfst-toolname --output=transducer.hfst
hfst-toolname > transducer.hfst

hfst-toolname --output=text.txt
hfst-toolname > text.txt

If the resulting transducer is written into the standard output stream, warnings and verbose output are printed to standard error stream instead of standard output. Error messages are always printed to standard error stream.

Input parameters for unary operator tools

The input filename may also be specified as free argument of command line or given through standard input, that is, the following are equivalent in terms of input file processing:

hfst-toolname --input=transducer.hfst
hfst-toolname transducer.hfst
cat transducer.hfst | hfst-toolname

hfst-toolname --input=text.txt
hfst-toolname text.txt
cat text.txt | hfst-toolname

Unary operations are:

hfst-determinize, hfst-fst2strings, hfst-fst2txt, hfst-fst2fst, hfst-head, hfst-invert, hfst-minimize, hfst-name, hfst-project, hfst-push-weights, hfst-remove-epsilons, hfst-repeat, hfst-reverse, hfst-split, hfst-strings2fst, hfst-summarize, hfst-tail, hfst-txt2fst

Parameters for binary operator tools

-1, --input1=FILENAME Read first input transducer from FILENAME
-2, --input2=FILENAME Read second input transducer from FILENAME

It is also possible to give one or both of the filenames as free arguments on command line, that is, all following are equivalent in terms of processing:

hfst-toolname --input1=first.hfst --input2=second.hfst
hfst-toolname first.hfst second.hfst
hfst-toolname --input1=first.hfst second.hfst
hfst-toolname --input2=second.hfst first.hfst
cat first.hfst | hfst-toolname --input2=second.hfst
cat first.hfst | hfst-toolname second.hfst
cat second.hfst | hfst-toolname --input1=first.hfst

If the binary operator is not commutative, the input1 or first transducer is the first or leftmost operand. E.g. for composition input1’s output level is matched against input2’s input level.

Binary operations are:

hfst-compare, hfst-compose, hfst-concatenate, hfst-conjunct, hfst-disjunct, hfst-subtract

hfst-compose-intersect takes two transducer inputs as parameters in the same way as the rest of the binary tools, although it processes the inputs in a slightly different way.

Parameters for tools operating on arbitrary number of transducers

For tools that operate on arbitrary number of input transducers, the list of filenames must be given as free parameters of command line, e.g.:

hfst-toolname input1.hfst input2.hfst ... input-n.hfst

HFST tools that operate on multiple input parameters are:

hfst-lexc

Defining the backend format

The tools that create transducers from scratch (hfst-sfstpl2fst, hfst-regexp2fst, hfst-strings2fst) or AT&T format (hfst-txt2fst) or perform binary format conversion (hfst-fst2fst) may specify the backend format of the resulting binary transducer(s).

-f, --format=FORMAT Use backend FORMAT

Legal parameters of FORMAT depend on backend library supports compiled in HFST. The default backend format, if available, is openfst-tropical. All available backend formats supported can be obtained with the command hfst-format --list. At time of HFST 3, the following strings are allowed:

allowed strings used backend note
sfst SFST backend
ofst-tropical, openfst-tropical, openfst, ofst OpenFST standard automata with tropical semiring weights (default)
ofst-log, openfst-log OpenFST with log weights
foma foma backend
optimized-lookup-weighted, olw, optimized-lookup, ol HFST's lookup-optimized automata with weights Not supported by hfst-regexp2fst or hfst-sfstpl2fst
optimized-lookup-unweighted, olu HFST's lookup-optimized automata without weights Not supported by hfst-regexp2fst or hfst-sfstpl2fst

Tools that take transducer(s) as input use the backend functions of the input transducers and write output in the same format. To use the functions of a different transducer library, the user must perform explicit conversion with hfst-fst2fst. For example, if transducer.sfst is a binary transducer in SFST format and the user wishes to use foma's inversion function and get the result in SFST format, the following commands are needed:

cat transducer.sfst | hfst-fst2fst --format foma | hfst-invert | hfst-fst2fst --format sfst 

The optimized lookup backend format is not supported by most tools, as it is mainly intended for fast lookup. The tools that support it are hfst-lookup, hfst-fst2fst, hfst-txt2fst, hfst-fst2txt, hfst-strings2fst and hfst-format.

Parameters for tools that support weights given on the command line

The tools hfst-regexp2fst and hfst-strings2fst have the following option:

-w, --weight=NUMBER Use NUMBER as default weight instead of semiring one

The weight NUMBER is parsed using standard library's strtod(3) implementation. The semantics for weights depends on selected backend.

Tool-specific parameters

For more detailed info on the parameters of a tool, see their wiki pages.

Transducer and file formats

The software is essentially created around the concept of synchronized transducers, i.e. the input and the output symbols of a transducer are synchronized symbol pairs. In order to reduce the number of different versions of our tools, one character encoding convention must be used for the input and output text formats. Currently Unicode with UTF8 is used in all utilities and all our demo lexicons are implemented in UTF8 (even the English lexicon).

In order to allow different input modes for various functionalities, it was found most convenient to separate the conversions between string, text and binary formats into separate modules. Unless otherwise specified on the command line, we assume that the input is read from the standard input and the output is directed to the standard output. The input and output may specify or contain several transducers. Transducers in text format are separated by a transducer delimiter ("--" plus a newline). A delimiter at the end of a file indicates that an empty transducer follows. A sequence of two delimiters indicates an empty transducer in between.

Transducer formats

HFST 3 stores automata in HFST automata container format, which consists of HFST3\0 magic sequence, HFST 3 metadata header, and the backend's own automaton in original format.

For operations that input and output transducers, the output is always of same type as input(s) and this cannot be overriden (except for hfst-fst2fst). The command line tools are file-format independent, they select the function based on the input data type. All input transducers are assumed to have the same binary format (which is concluded from the first transducer in the first transducer input). In the text-to-transducer tools, the format is selected by the user with options. To convert between different transducer binary formats, the tool hfst-fst2fst is provided. Tools which operate on multiple transducers, will issue error message if fed with different types of transducers.

Transducer input and output

The tools support transducer files (or pipelined transducer input/output) containing a sequence of transducers.

If tools that take a single transducer input are used, the tools repeat the operation for each input transducer in the input as if they had been provided in separate invocations.

The tools that take two transducer inputs repeat the operation pairwise for each pair of transducers read from the inputs. The transducer inputs must contain the same number of transducers, else the program exits and prints an error message. An exception is the case where the first input contains one or more transducers and the second one exactly one transducer. In this case the operation is applied for each transducer in the first input so that the second transducer remains the same all the time.

The tool hfst-compose-intersect allows both transducer inputs contain one or more transducers and the inputs can have a different number of transducers. The tool applies composing intersection for each transducer in the first input so that the set of transducers (i.e. all transducers read from the second input) remains same all the time.

To further operate on sequences of transducers, tools hfst-head, hfst-tail and hfst-split can be used.

Transition symbols

For tools that take strings or AT&T text format as input or print them as output, the following special symbols are reserved:

symbol meaning
"@_EPSILON_SYMBOL_@" The epsilon.
"@0@" An alternative representation of the epsilon.
"@_UNKNOWN_SYMBOL_@" Any symbol not known to a transducer.
"@_IDENTITY_SYMBOL_@" Any identity symbol pair not known to a transducer.

Some tools may take input or produce output that uses a different formalism. For instance, in SFST programming language the epsilon is always denoted as "<>". However, the resulting transducers always use the above-mentioned special symbols:

$ echo "<>:a" | hfst-sfstpl2fst -f sfst | hfst-fst2txt
@0@  a    0    1
1

The internal representation of a transition label in a transducer is a number. The mapping from symbols (strings) to numbers is done internally. If the mappings differ between transducers, harmonization is carried out.

Reporting bugs

All bugs in command line tools shall be reported to sourceforge's HFST issue tracker. It is good to include at least steps to reproduce the error (i.e. exact command(s) used), and first line of output of command hfst-tool --version. E.g. include the following in your message:

$ hfst-tool --version
HFST Toolname 0.1 (hfst 3.0)
$ hfst-toolname [PARAMETERS that fail]
Failure output

You may also direct email to HFST team.

Development and distribution

Source code archive contains test suite make check, which must be passed for all distributed versions, unless clearly labeled as alpha test versions.