HFST: Creating spell-checkers

Creating spell-checker from HFST-based dictionaries is relatively simple. These spell-checkers can easily be integrated to systems like OpenOffice.org, enchant, Mac OS spell service etc.

To begin you need a language model, that is, a dictionary. The following styles of dictionaries are easily supported in HFST:

  • lexc, twolc, xfst
  • hunspell
  • word list
  • finite-state automaton from another source

Here is a brief recap on how to compile such

Xerox lexc, twolc and xfst

  1. Use hfst-lexc to compile lexc files; for further instructions see HfstLexcWrapper
  2. Use hfst-twolc to compile twolc files; for further instructions see HfstTwolc
  3. Use hfst-compose-intersect to combine lexc lexicons and twolc rules
  4. Use hfst-xfst to compile xfst scripts. If the script isn't self-contained you may need to modify it.
  5. Use hfst-project -p output to extract dictionary from the morphological analyser. Remember that in HFST tools the morphological analyser created in xerox styles is by default a generator.

Hunspell dictionaries

N.B: hunspell compilation is preliminary

  1. Use hunspell-aff2lexc+twolc to translate affix file to lexicon and rules
  2. Use hunspell-dic2lexc to translate dictionary to lexicon
  3. Use hfst-lexc to compile dictionary and affixes
  4. Use hfst-twolc to compile deletion rules
  5. Use hfst-compose-intersect to compose rules to lexicon
  6. Use hfst-project -p lower to create dictionary

Wordlist dictionaries

  1. Use hfst-strings2fst to create dictionary
  2. That is all.

Edit distance

  1. Use hfst-txt2fst on attached edit-distance-1 to generate edit distance error model
  2. use hfst-repeat to genereate larger edit distances

Hunspell error models

When using hunspell compilers, you can use hunspell's own error models instead of basic edit distance.

  1. Use hfst-* tools to compile and combine models

Wrapping up spell-checker

Version 0.1.1 of hfst-ospell and relative versions of voikko require a directory $prefix/lib/voikko/2/mor-hfst-$LL/ containing:

  • spl.hfstol: The dictionary to check correct spelling from; use hfst-fst2fst to generate this from dictionary
  • err.hfstol: The error model applied in correction; use hfst-fst2fst to generate this from error model
  • alphabet.hfstol: letters used in words of the language in dictionary; use hfst-fst2fst to generate this from error model
  • sug.hfstol: The dictionary for suggestions; use hfst-fst2fst to generate this from dictionary
  • voikko-fi_FI.pro: some metadata; use attached template

Topic revision: r2 - 2014-02-10 - ErikAxelson
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback