hfst-strings2fst

Purpose

Compile string pairs or pair-strings into transducer(s).

Usage

The help message:

Usage: hfst-strings2fst [OPTIONS...] [INFILE]
Compile string pairs and pair-strings into transducer(s)

Common options:
  -h, --help             Print help message
  -V, --version          Print version info
  -v, --verbose          Print verbosely while processing
  -q, --quiet            Only print fatal erros and requested output
  -s, --silent           Alias of --quiet
Input/Output options:
  -i, --input=INFILE     Read input strings from INFILE
  -o, --output=OUTFILE   Write output transducer to OUTFILE
String and format options:
  -f, --format=FMT          Write result in FMT format
  -j, --disjunct-strings    Disjunct all strings instead of transforming
                            each string into a separate transducer
      --norm                Divide each weight by sum of all weights
                            (with option -j)
      --log                 Take negative natural logarithm of each weight
      --log10               Take negative 10-based logarithm of each weight
  -p, --pairstrings         Input is in pairstring format
  -S, --has-spaces          Input has spaces between symbols/symbol pairs
  -e, --epsilon=EPS         Interpret string EPS as epsilon.
  -m, --multichar-symbols=FILE   Strings that must be tokenized as one symbol.

If OUTFILE or INFILE is missing or -, standard streams will be used.
FMT can be { foma, openfst-tropical, openfst-log, sfst,
optimized-lookup-weighted, optimized-lookup-unweighted }.
If EPS is not defined, the default representation of @0@ is used.
Option --norm precedes option --log.
The FILE of option -m lists all multichar-symbols, each symbol
on its own line.
Backslash '\' may be used to escape ':', tab and itself. For any
other symbol x '\x' means x literally, i.e. is the same as 'x'.
The weight of a string can be given after the string separated
by a tabulator. The weight cannot be zero.

Examples:
  echo "cat:dog" | hfst-strings2fst            create cat:dog fst
  echo "c:da:ot:g" | hfst-strings2fst -p       same as pairstring
  echo "c:d a:o t:g" | hfst-strings2fst -p -S  same as pairstring with spaces
  echo "c a t:d o g" | hfst-strings2fst -S     same with spaces

Report bugs to <hfst-bugs@helsinki.fi> or directly to our bug tracker at:
<https://sourceforge.net/tracker/?atid=1061990&group_id=224521&func=browse>

Input

The strings can be given in various formats using options --pairstrings and --has-spaces. Below are four possible ways to create a transducer that maps "cat" to "dog".

options used input explanation
none "cat:dog" as string pairs (the default representation)
--has-spaces  "c a t:d o g" as string pairs with (one or more) spaces between each symbol
--pairstrings  "c:da:ot:g" as symbol pair strings (as in SFST)
--has-spaces --pairstrings  "c:d a:o t:g" as symbol pairs separated by (one of more) spaces

In space separated representations, the spaces are obligatory. A sequence of several spaces is equivalent to a single space. Consecutive letters are then interpreted as single (multi-character) symbols. Spaces before and after the colon ':' are optional in the string pair representation.

Special symbols

Special symbols recognized by the tool:

symbol representations recognized note
epsilon "@0@", "@_EPSILON_SYMBOL_@" additional representations of the epsilon can be given with the option -e EPS
space " ", "@_SPACE_@" an unescaped space separates the symbols or symbol pairs in space separated input
colon ":", "@_COLON_@" an unescaped colon separates input and output strings (in string pairs) or input and output symbols (in pair strings)
tab "@_TAB_@" an unescaped tabulating character separates the string and weight fields

Examples

Weighting

If we have a file strings.txt

cat   7
dog   4
mouse 2

that lists a set of words and how many times they have occurred in a text, the commands

cat strings.txt | hfst-strings2fst --norm --disjunct-strings -f openfst-tropical | hfst-fst2strings -w

give as output the probabilities of each word:

cat:cat      0.538462
dog:dog      0.307692
mouse:mouse  0.153846

Note that we use openfst-tropical as implementation format, because we need the weights. We also use --disjunct-strings because we want to disjunct all strings given in strings.txt into a single transducer.

Multichar symbols

Suppose that we have a file symbols.txt which contains the strings which we want to tokenize as single symbols (when we are not using the spaces as separators between symbols or symbol pairs):

epsilon
foo
bar
baz

The commands:

echo "epsilonbaz:foobar" | hfst-strings2fst --multichar-symbols symbols.txt --epsilon "epsilon"

will then create the same result as we would have got by:

echo '@0@:foo baz:bar' | hfst-strings2fst -p -S

See also

HfstFst2Strings

-- MiikkaSilfverberg - 11 Jun 2008