hfst-lexc

Purpose

Compile lexc files into an HFST transducer.

Usage

hfst-lexc [OPTIONS] [INFILEs...]

The help message:

Usage: hfst-lexc [OPTIONS...] [INFILE1...]]
Compile lexc files into transducer

Common options:
  -h, --help             Print help message
  -V, --version          Print version info
  -v, --verbose          Print verbosely while processing
  -q, --quiet            Only print fatal erros and requested output
  -s, --silent           Alias of --quiet
Input/Output options:
  -f, --format=FORMAT     compile into FORMAT transducer
  -o, --output=OUTFILE    write result into OUTFILE
Lexc options:
  -A, --alignStrings      align characters in input and output strings
  -E, --encode-weights    encode weights when minimizing (default is false)
  -F, --withFlags         use flags to hyperminimize result
  -M, --minimizeFlags     if --withFlags is used, minimize the number of flags
  -R, --renameFlags       if --withFlags and --minimizeFlags are used, rename
                          flags (for testing)
  -x, --xerox-composition=VALUE Whether flag diacritics are treated as ordinary
                                symbols in composition (default is true).
  -X, --xfst=VARIABLE     toggle xfst compatibility option VARIABLE.
  -W, --Werror            treat warnings as errors

If INFILE or OUTFILE are omitted or -, standard streams will be used
The possible values for FORMAT are { sfst, openfst-tropical, openfst-log,
foma, optimized-lookup-unweighted, optimized-lookup-weighted }.
VALUEs recognized are {true,ON,yes} and {false,OFF,no}.
Xfst variables are {flag-is-epsilon (default OFF)}.

Examples:
  hfst-lexc -o cat.hfst cat.lexc               Compile single-file lexicon
  hfst-lexc -o L.hfst Root.lexc 2.lexc 3.lexc  Compile multi-file lexicon

Using weights:
  LEXICON Root
  cat # "weight: 2" ;    Define weight for a word
  <[dog::1]+> # ;        Use weights in regular expressions

Using weights has an effect only if FORMAT is weighted, i.e.
{ openfst-tropical, openfst-log, optimized-lookup-weighted }.

Report bugs to <hfst-bugs@helsinki.fi> or directly to our bug tracker at:
<https://sourceforge.net/tracker/?atid=1061990&group_id=224521&func=browse>

Details

If you want to automate FST building in your Makefiles, the following suffix rule (or similar) may be useful:

%.lexc.hfst: %.hlexc
    $(HFSTLEXC) --verbose --output=$@ $<

# of course for any sufficiently reasonable morphology there are dozens of files:
unk.lexc.hfst: unk.root.hlexc unk.verbs.hlexc unk.nouns.hlexc unk.others.hlexc
    $(HFSTLEXC) -v -o $@ $^

Error messages and warnings

Many of the messages are modelled after xerox's lexc utility.

Do note that, if you do not specify --verbose nor --quiet, warnings and errors will still be printed, but little else. If absolutely no output needs to be printed, --quiet may be used to suppress non-fatal warnings.

Examples

hfst-lexc -o file.hfst file.hlexc
compiles single lexicon with default options

hfst-lexc -v -o file.hwfst file1.hlexc file2.hlexc
compile weighted lexicons from multiple sources

Weights

The following example recognizes the words "cat", "dog", "cats" and "dogs" with respective weights 2, 3, 5 and 6.

Multichar_Symbols +Sg +Pl

LEXICON Root
cat     Num   "weight: 1"       ;
dog     Num   "weight: 2"       ;

LEXICON Num
+Sg:    #     "weight: 1"       ;
+Pl:s   #     "weight: 4"       ;

The following example recognizes any number of consecutive cats with a weight equal to the number of cats, i.e. "cat" with weight 1, "catcat" with weight 2, etc.

LEXICON Root
<[cat::1]+>        #        ;

Using weights has an effect only if FORMAT is weighted, i.e. { openfst-tropical, openfst-log, optimized-lookup-weighted }. For more information on using weights in regular expressions, see HfstUsingWeights.

Known bugs

  • Using < @"filename" > construction to read automata of other format than -f switches value may not work
  • Last row of lexicon must also be ended with a semicolon ;.
  • The original lexc user interface is not included, you must do everything with command line
  • % cannot be used to break multicharacter symbols apart; both ij and i%j will be parsed as character ij, if such exists. Work around by inserting 0, not %, between symbols.
  • Encodings other than UTF-8 are not supported, by design. Use recode or iconv to convert legacy data.
  • If xerox style regular expressions are used, the compilation is significantly slower than only with string entries.
  • The flex based lexer uses lookahead, which could easily be replaced with yyless stuff.

Obtaining the program

HfstLexc is a part of HFST tools, downloadable from http://sourceforge.net/projects/hfst/files/hfst/.