hfst-guessify

The guesser compiler is part of HFST (University of Helsinki Finite State Transducer interface) finite state toolkit distribution; a tool that creates weighted transducer for suffix based guessing. This tool is licenced under GNU GPL version 3 (other licences may be available at request). The licence can be found from file COPYING.

Downloading

HFST guessify is part of the HFST package.

Dependencies

No external dependencies

Installation

The binary is installed as part of the HFST package, i.e. from the source distribution command make install is sufficient.

Usage

Usage: hfst-guessify [OPTIONS...] [INFILE]
Compile a morphological analyzer into a guesser and generator.

Common options:
  -h, --help             Print help message
  -V, --version          Print version info
  -v, --verbose          Print verbosely while processing
  -q, --quiet            Only print fatal erros and requested output
  -s, --silent           Alias of --quiet
Input/Output options:
  -i, --input=INFILE     Read input transducer from INFILE
  -o, --output=OUTFILE   Write output transducer to OUTFILE
Guesser options:
  -p, --default-penalty           Give penalty for skipping one
                                  symbol of input (1.0 by default).
  -G, --do-not-compile-generator  When compiling the guesser, do
                                  not compile a model form
                                  generator.

All analyses in the morphological analyzer should have the form:
w o r d f o r m POS [GUESS_CATEGORY=CLASS] X Y Z ...
where POS is the part-of-speech tag, [GUESS_CATEGORY=CLASS]
is an inflectional category marker and X, Y and Z are inflectional
markers. The form of the inflectional category marker is fixed.
CLASS can be any string, which doesn't contain "]".

Using the option -d will reduce the size of the guesser file by
approximately half, but may substantially increase the load time of
the guesser when generating model forms. If you only need to guess
analyses of unknown word forms, -d has no effect on load time.

If OUTFILE or INFILE is missing or -, standard streams will be used.

Report bugs to <hfst-bugs@helsinki.fi> or directly to our bug tracker at:
<https://sourceforge.net/tracker/?atid=1061990&group_id=224521&func=browse>

The class marker X in [GUESS_CATEGROY=X] can be any sequence of utf-8 characters.

Option p specifies a penalty for skipping symbols during guessing.

Option G prevents the compilation of a model form generator. The resulting guesser is smaller and it takes less time to compile it, but a generator has to be compiled, when generating word forms. Use G if you don't need to generate model forms.

Examples

Finnish

We first demonstrate compiling a guesser from the Omorfi morphology for Finnish. Analyses in Omorfi consist of a lemma followed by word-class, a declension class marker and inflectional information. E.g. the analysis of the Finnish nominaive singular noun "kissa" (cat) belongs to declension class 9:

[BOUNDARY=LEXITEM][LEMMA='kissa'][POS=NOUN][KTN=9][NUM=SG][CASE=NOM][BOUNDARY=LEXITEM][CASECHANGE=NONE]

First of all the morphological analyzer has to be transformed into tropical openfst format, if it is in optimized lookup format:

hfst-fst2fst -t -o morphology.omor.hfst morphology.omor.ol
In order to transform Omorfi into a guesser, the declension class markers, e.g. [KTN=9], need to be transformed into the form [GUESS_CATEGORY=9]. We accomplish this by using HfstSubstitute. We first create a substitution file for HfstSubstitute, using HfstSummarize:
hfst-summarize -v morphology.omor.hfst |                             # Display a list of all symbols in the transducer 
grep -A1 "sigma set:" |                                              # one per line.
tail -1 | 

sed 's/, /\n/g' | grep "\[KTN=" |                                    # select declension class symbols.

sed 's/\[KTN=\(.*\)/&     [GUESS_CATEGORY=\1/' > decl_substitutions  # Append a tab and "[GUESS_CATEGORY=X]" to every
                                                                     # line "[KTN=X]". Store the resulting two column
                                                                     # list in the file decl_substitutions.

We then substitute the declension class markers:

hfst-substitute -F decl_subsisutions -o morphology.omor.hfst.subst morphology.omor.hfst

For the word "kissa", HfstLookUp now gives

[BOUNDARY=LEXITEM][LEMMA='kissa'][POS=NOUN][GUESS_CATEGORY=9][NUM=SG][CASE=NOM][BOUNDARY=LEXITEM][CASECHANGE=NONE]

Before converting the analyzer into a guesser, we need to filter out compound words, because it is computationally too demanding to compile the entire analyzer into a guesser.

All analyses stemming from the productive compounding mechanism in Omorfi contain [GUESS=COMPOUND] tags between the parts of the compound. We filter out all paths, which contain such tags. We first use HfstRegexp2Fst to compile a filter transducer and then compose Omorfi with that transducer:

echo "[ ? - %[GUESS%=COMPOUND%] ]*" | hfst-regexp2fst > filter_compounds

hfst-compose -F -1 morphology.omor.hfst.subst -2 filter_compounds |         
hfst-minimize > morphology.omor.hfst.subst.no_compounds

All that remains now is converting the modified morphology into a guesser:

hfst-guessify -v  morphology.omor.hfst.subst.no_compounds > omorfi.guesser.hfst

Erzya

We demonstrate transforming an Erzya morphological analyzer to a guesser. For the word "\x{043a}\x{0430}\x{0442}\x{043a}\x{0430}" (cat) the analyzer gives two analyses:

\x{043a}\x{0430}\x{0442}\x{043a}\x{0430}<N><Sg><Nom><Indef><Pred><Ind><NonPast>+ScSg3
\x{043a}\x{0430}\x{0442}\x{043a}\x{0430}<N><Sg><Nom><Indef>

The analyses consist of a word class marker ( <N> , <V> , ...) followed by inflectional tags ( <Ind> , <Nom> , ...). There are no declension class markers. In order to transform the analyzer into a guesser, we thus first have to add dummy declension class markers after the word class markers. The lack of declension class markers is not a problem for guessing, but when generating model forms using HfstGuess, there may be a lot of ambiguity if declension classes aren't separated.

Erzya has extensive derivational morphology. It is possible that derivational affixes are added to inflected forms, which can then be further inflected and/or subjected to more derivational morphology. We add the delcension class marker after the last word class marker, since were not interested in derivational information for guesses, because it is bound to be quite unreliable.

This can be accomplished by composing the analyzer with a regular expression, which findst the first word-class symbol and adds a [GUESS_CATEGROY=1] tag after it. Schematically the regular expression looks like:

[? - TAG]* [?* DER]* TAG 0:%[GUESS%_CATEGORY%=1%] [? - DER]*

The sub-expressions TAG and DER are replaced by expressions matching all tags and derivational tags respectively. The expression inserts tha dummy declension class symbol after the first tag following the last derivational symbol. In practice this is the same as the last word class symbol. [The expression could be more straightforward, if the I knew for sure which of the tags denote word classes...]

We compile the expression into a transducer exp, compose it with the Erzya morphological analyzer myv-mor.hfst and minimize the result

hfst-compose -1 myv-mor.hfst -2 exp | 
hfst-minimize > myv.decl.hfst

Finally we transform the result into a guesser

hfst-guessify myv.dcl.hfst -o myv.guesser.hfst

-- MiikkaSilfverberg & TommiPirinen & KristerLinden