hfst-proc

Purpose

A tool for performing morphological analysis and generation with finite-state transducers in HFST optimized lookup format. This program is intended to clone functionality of apertium's lt-toolbox's lt-proc and vislcg3's cg-proc and is best used in conjunction with those softwares because of certain idiosyncracies in i/o format; limited support for general corpus processing mode is available.

Usage

hfst-proc [ -a [ -p | -C | -x ] [ -k ] | -g | -n | -d | -t ] [ -W  ]  [  -n  N  ]  [ -B -c | -w ] [ -z ] [ -v | -q ] transducer_file [input_file [output_file]]

hfst-proc [ [ --analysis [ --apertium | --cg | --xerox ] [  --keep-compounds  ]  |  --generation | --non-marked-gen | --debugged-gen | --tokenize ] [ --show-weights ] [ --analyses=N ]  [  -B  --case-sensitive  | --dictionary-case  ]  [ --null-flush ] [ --verbose | --quiet ] [ --version | --help ] transducer_file [input_file [output_file]]

Parameters

Parameter Explanation
-a,--analysis Tokenizes the text in surface forms (lexical units as they appear in texts) and delivers, for each surface form, one or more lexical forms consisting of lemma, lexical category and morphological inflection information. Multi-word surface forms are analysed in a left-to-right, longest-match fashion. Multi-word surface forms may be invariable (such as a multi-word preposition or conjunction) or inflected (for example, in es, "echaban de menos", "they missed", is a form of the imperfect indicative tense of the verb "echar de menos", "to miss"). Single-word surface forms analysis produces output like the one in these examples:

"cantar" -> `^cantar/cantar<vblex><inf>$'

or

`"daba" -> `^daba/dar<vblex><pii><p1><sg>/dar<vblex><pii><p3><sg>$'.

-g, --generation Delivers a target-language surface form for each target-language lexical form, by suitably inflecting it.
-n, --non-marked-gen Morphological generation (like -g) but without unknown word marks (asterisk `*').
-d, --debugged-gen Morphological generation (like -g) but retaining part-of-speech tags.
-t, --tokenize Split the input stream into the symbols which would be fed into the transducer during analysis or generation, as well as showing which characters are unrecognized by the transducer. This is mainly used for debugging.
-p, --apertium Print the results of analysis in conformance with the Apertium stream format (default).
-C, --cg Print the results of analysis in the format expected by cg-proc(1). This implies -w
-x, --xerox Print the results of analysis in Xerox format. This is the default output format of hfst-optimized-lookup(1)
-k, --keep-compounds Keep all compound analyses instead of removing any with more components than the minimum number of components of any available analyses.
-W, --show-weights For weighted transducers, print the final analysis weights.
-B -N N, --analyses=N Output no more than N analyses.
-c, --case-sensitive Use the literal case of the incoming characters instead of allowing upper-case characters to be treated as lower-case.
-w, --dictionary-case Use the case information contained in the lexicon, instead of the surface case (only applied in analysis mode).
-z, --null-flush Flush output on the null character
-v, --version Display the version number.
-X, --raw Do not perform any case changes or unescaping
-h, --help Display this help.

transducer_file The input transducer compiled into HFST's optimized lookup format.

input_file The source of the input text. If not given, <stdin> is used.

output_file Where to print the output. If not given, <stdout> is used.


++ The help message

Usage: hfst-proc [-a [-p|-C|-x] [-k]|-g|-n|-d|-t] [-W] [-n N] [-c|-w] [-z] [-v|-q|]
    transducer_file [input_file [output_file]]
Perform a transducer lookup on a text stream, tokenizing on the fly
Transducer must be in HFST optimized lookup format

  -a, --analysis          Morphological analysis (default)
  -g, --generation        Morphological generation
  -n, --non-marked-gen    Morph. generation without unknown word marks
  -d, --debugged-gen      Morph. generation with everything printed
  -t  --tokenize          Tokenize the input stream into symbols (for debugging)
  -p  --apertium          Apertium output format for analysis (default)
  -C  --cg                Constraint Grammar output format for analysis
  -x, --xerox             Xerox output format for analysis
  -e, --do-compounds      Treat '+' and '#' as compound boundaries
  -k, --keep-compounds    Retain compound analyses even when analyses with fewer
                          compound-boundaries are available
  -W, --show-weights      Print final analysis weights (if any)
  -r, --show-raw-in-cg    Print the raw analysis string as sub-reading in CG output
  -N N, --analyses=N      Output no more than N analyses
                          (if the transducer is weighted, the N best analyses)
  --weight-classes N      Output no more than N best weight classes
                          (where analyses with equal weight constitute a class
  -c, --case-sensitive    Perform lookup using the literal case of the input
                          characters
  -w  --dictionary-case   Output results using dictionary case instead of
                          surface case
  -z  --null-flush        Flush output on the null character
  -v, --verbose           Be verbose
  -q, --quiet             Don't be verbose (default)
  -V, --version           Print version information
  -h, --help              Print this help message
  -X, --raw               Do not perform any mangling to:
                          case, ``superblanks'' or anything else!!!

Report bugs to hfst-bugs@helsinki.fi

Details

Perform morphological analysis and generation on a stream of running text using a finite-state transducer in HFST's optimized-lookup format (conventional file extension: .hfstol) as generated by hfst-lookup-optimize. The default output format and to a large extent the command-line options are equivalent to those of lt-proc, a component of Apertium's lttoolbox. The input text must conform to the Apertium stream format. The output of this command with the -x option should in most cases be equivalent to the output of the hfst-lookup command (and xerox tools with certain settings). The output of --cg option should be compatible with vislcg3.

Examples

  apertium-destxt sun-and-northwind.txt | hfst-proc english.hfstol

Parses text file sun-and-northwind.txt into apertium stream (using tools found from apertium installation), analyses using automata file english.hfstol it outputting in apertium format.

  cat kalevala.txt | hfst-proc --raw --cg omorfi.hfstol

Parse text file kalevala.txt in raw text mode without apertium stream format special characters and process it with omorfi.hfstol to output in cg compatible format.

Obtaining the program and installing

Hfst-proc is a part of HfstCommandlineTools and installed by default.