HFST: Overview and Quick Start

Overview

The Helsinki Finite-State Transducer toolkit is intended for processing natural language morphologies. The toolkit is demonstrated by wide-coverage implementations of a number of languages of varying morphological complexity.

Our web demos show various applications that the HFST toolkit can be used for, including morphological analysis, spell-checking and correction, translation, hyphenation and finding synonyms.

Basic terminology

A finite-state transducer (FST) is a data structure which recognizes a set of strings and transduces (i.e. translates) each string into another string. A morphological transducer basically recognizes words in a given language and produces an analysis for each word. The analysis usually contains the base form of the word and it part of speech followed by morphological information, e.g. person, gender, number, tense, aspect, mood, voice, comparison etc.

input: 
  This is a test.
output: 
  This[DET] 
  be[V]+V+3sg+PRES 
  a[ART] 
  test[N]+N 

The same example for Finnish:

input:
  Tämä on testi.
output:
  Tämä Pron Dem Nom Sg
  olla V Prs Act Sg3
  testi N Nom Sg

A transducer can also be applied in the opposite direction, generating inflected forms from base form and morphological information:

input:
  Tämä Pron Dem Nom Sg olla V Prs Act Sg3 testi N Nom Sg
output:
  Tämä
  on
  testi

Note that morphological transducers often give multiple analyses per word and the user must disambiguate the results by choosing the correct analyses. For simplicity, the examples above only give one result for each word.

Transducers can perform also other than morphological analysis. A spell-checker checks each word for misspellings and outputs suggested corrections. A translator takes as input text in language X and produces the corresponding output in language Y. A hyphenator breaks its input into syllables separated by hyphens.

Downloads

Downloadable language files with source codes:

Downloadable utilities and libraries with source code:

Tool User Quick Start

Download and compile a lexicon

If you have installed hfst, download a Finnish lexicon text file from:

http://hfst.github.io/downloads/finntreebank.lexc

and use the commands mentioned in the beginning of the file:

hfst-lexc -v -f foma finntreebank.lexc -o finntreebank.inverted.hfst
hfst-invert -v
finntreebank.inverted.hfst -o finntreebank.debug.hfst
hfst-fst2fst -v finntreebank.debug.hfst -f olw -o finntreebank.hfst

You may also download some precompiled lexicons for various languages from

https://sourceforge.net/projects/hfst/files/resources/morphological-transducers/

Use the lexicon

You can try out the Finnish lexicon with some word, e.g. "testi":

echo "testi" | hfst-lookup finntreebank.hfst

and you should get the line:

testi testi<N><sg><nom> 0.000000

Try a non-word

echo "xtesti" | hfst-lookup finntreebank.hfst

and you should get:

xtesti xtesti+? inf

Other lookup tools

There is a tool that does some useful things with capital letters, but may be slightly slower. You can feed it text and not only single words:

cat your-text | hfst-proc [--xerox] finntreebank.hfst

On the other hand, if you need speed, e.g. when you have millions of words to analyze, you may wish to feed your list of words to the lookup command:

cat your-list | hfst-lookup finntreebank.hfst

All commands have various parameters that will give you different formatting of the output. You get advice on those with the --help option, e.g.

hfst-lookup --help


-- ErikAxelson - 2012-04-04