hfst-lookup

Purpose

Perform fast transducer lookup, i.e. look up a set of input strings in the transducer and print the corresponding output strings.

Usage

The help message:

Usage: hfst-lookup [OPTIONS...] [INFILE]
perform transducer lookup (apply)

Common options:
  -h, --help             Print help message
  -V, --version          Print version info
  -v, --verbose          Print verbosely while processing
  -q, --quiet            Only print fatal erros and requested output
  -s, --silent           Alias of --quiet
Input/Output options:
  -i, --input=INFILE       Read input transducer from INFILE
  -o, --output=OUTFILE     Write output to OUTFILE
  -p, --pipe-mode[=STREAM] Control input and output streams
Lookup options:
  -I, --input-strings=SFILE        Read lookup strings from SFILE
  -O, --output-format=OFORMAT      Use OFORMAT printing results sets
  -e, --epsilon-format=EPS         Print epsilon as EPS
  -F, --input-format=IFORMAT       Use IFORMAT parsing input
  -x, --statistics                 Print statistics
  -X, --xfst=VARIABLE              Toggle xfst VARIABLE
  -c, --cycles=INT                 How many times to follow input epsilon cycles
  -b, --beam=B                     Output only analyses whose weight is within B from
                                   the best analysis
  -t, --time-cutoff=S              Limit search after having used S seconds per input
                                   (currently only works in optimized-lookup mode
  -P, --progress                   Show neat progress bar if possible

If OUTFILE or INFILE is missing or -, standard streams will be used.
Format of result depends on format of INFILE
OFORMAT is one of {xerox,cg,apertium}, xerox being default
IFORMAT is one of {text,spaced,apertium}, default being text,
unless OFORMAT is apertium
VARIABLEs relevant to lookup are {print-pairs,print-space,
quote-special,show-flags,obey-flags}
Input epsilon cycles are followed by default INT=5 times.
Epsilon is printed by default as an empty string.
B must be a non-negative float.
S must be a non-negative float. The default, 0.0, indicates no cutoff.
If the input contains several transducers, a set containing
results from all transducers is printed for each input string.

STREAM can be { input, output, both }. If not given, defaults to {both}.
If input file is not specified with -I, input is read interactively line by
line from the user. If you redirect input from a file, use --pipe-mode=input.
--pipe-mode=output is ignored on non-windows platforms.

Todo:
  For optimized lookup format, only strings that pass flag diacritic checks
  are printed and flag diacritic symbols are not printed.
  Support VARIABLE 'print-space' for optimized lookup format

Known bugs:
  'quote-special' quotes spaces that come from 'print-space'

Report bugs to <hfst-bugs@helsinki.fi> or directly to our bug tracker at:
<https://sourceforge.net/tracker/?atid=1061990&group_id=224521&func=browse>


Input

The option --input defines the transducer where strings are looked up. The free argument can also be used to give the transducer. The option --input-strings defines where lookup strings are read. If either option is not defined, the standard input is used. The following are equivalent commands:

hfst-lookup --input transducer.hfst --input-strings strings.txt
hfst-lookup transducer.hfst --input-strings strings.txt
cat strings.txt | hfst-lookup transducer.hfst
cat transducer.hfst | hfst-lookup --input-strings strings.txt 

NOTE: If the transducer is not in optimized lookup format, the tool will give a warning that the lookup will be slow. You can convert a transducer into optimized lookup format with the tool hfst-fst2fst.

The option --input-format defines the format of the strings that are looked up. The formats are { text, spaced, apertium }, text being the default unless apertium is used as output format (then the default is apertium for input format as well). If we want to look up words "cat" and "dog", we would use the following inputs with different input formats.

input format input more information
text
cat
dog
spaced
 
apertium
cat
dog
http://wiki.apertium.org/wiki/Apertium_stream_format

Output

Output format

Output format can be chosen with the option --output-format from { xerox, cg, apertium } xerox being the default. For example, if we have a weighted transducer cat2chat.hfst that maps "cat" to "chat" with weight 3 and a following file named words.txt that contains words to look up,

cat
dog

the command

hfst-lookup --input cat2chat.hfst --input-strings words.txt --output-format output_format

gives us the following results with different values of output_format:

output format result note more information
xerox
cat     chat    3.000000

dog     dog+?   inf

+? indicates that "dog" is not found. http://www.stanford.edu/~laurik/fsmbook/clarifications/lookup-2.html 
 cg
"<cat>"
        ""chat  3.000000

"<dog>"
        "dog" ? Inf

? indicates that "dog" is not found. http://beta.visl.sdu.dk/visl/vislcg-doc.html
 apertium
^cat/chat$
^dog/*dog$
* indicates that "dog" is not found. http://wiki.apertium.org/wiki/Apertium_stream_format

Xfst variables

See hfst-fst2strings. However, -X quote-special works differently in hfst-lookup, see below.

Special symbols

Special symbols are printed as follows unless options -X or -e are used:

symbol printed as note
epsilon "" can be changed to EPS with -e EPS
colon ":" printed as "\:" if -X quote-special is requested
tabulator as such printed as "\        " if -X quote-special is requested
space " " printed as "\ " if -X quote-special is requested
flag diacritics "" printed if -X print-flags is requested

Examples

We first create a simple transducer singular2plural.hfst that maps words in singular to their plural forms:

echo "cat:cats
> mouse:mice
> cactus:cacti
> cactus:cactuses" | hfst-strings2fst -j -f sfst > singular2plural.hfst

Then we look up a set of words in the transducer:

echo "cat
> dog
> mouse
> cactus" | hfst-lookup singular2plural.hfst

We get the following results:

cat     cats    0.000000

dog     dog+?   inf

mouse   mice    0.000000

cactus  cacti   0.000000
cactus  cactuses        0.000000

We see that the transducer singular2plural.hfst gives one result for the strings "cat" and "mouse", two results for the string "cactus" and no results for the string "dog".

Shortcomings

Hfst-lookup is very fast if the transducer is in optimized lookup (OL) format. In other cases (openfst-tropical, sfst, foma) the transducer is converted into generic HFST basic transducer format whose lookup is relatively slow. It is advisable to first convert a non-ol transducer into ol format with hfst-fst2fst to achieve better performance with hfst-lookup.

See also

HfstFst2Strings, HfstOptimizedLookup

-- ErikAxelson - 30 Jun 2008