Building weighted finite-state spell-checkers with HFST tools

Presented in FSMNLP 2012, Donostia. See also, the slides.

Compiling a language model from corpora

This language model is strongly inspired by Peter Norvig's nice essay about writing English spelling corrector in one day in Python. Specifically, you can use his big.txt as a corpus to replicate his results here. For demonstrational purposes, I am going to build our own "corpus" here online:

cat > demo-corpus.txt
Now we can write some examples examples here.
And press CTRL-D to end the file.

To extract words and frequencies from the corpus:

tr -d '[:punct:]' < demo-corpus.txt | tr -s '[:space:]' '\n' | sort | uniq -c | sort -nr | awk -f tropicalize-uniq-c.awk --assign=CS=20 | hfst-strings2fst -j -f openfst -o demo-lm.hfst

Note the awk script tropicalize-uniq-c.awk, it just -logs all the frequencies in the uniq -c data and divides by corpus size, which we explicitly specify using CS=20. The script is:

{printf("%s\t%f\n", $2, -log($1/CS));}

You can find beautified version from http://hfst.svn.sourceforge.net/viewvc/hfst/trunk/articles/fsmnlp-2012-spellers/tropicalize-uniq-c.awk?revision=2573&view=markup

Creating error models

For most error models we need this concept of correctly typed runs of characters, any of those at all. Let's compile it with Xerox notation regexp:

echo '?*' | hfst-regexp2fst -o anystar.hfst

The edit distance error model we create using Xerox notation for now:

echo 'a:0::10 | 0:a::10 | a:b::10 | b:0::10 | 0:b::10 | b:a::10 | ?:0::10 | 0:?::10 | ?:?::10' | hfst-regexp2fst -o edit1.hfst

Or with swaps:

echo 'a:0::10 | 0:a::10 | a:b::10 (b:a) | b:0::10 | 0:b::10 | b:a::10 (a:b) | ?:0::10 | 0:?::10 | ?:?::10' | hfst-regexp2fst | hfst-minimize -o edit1.hfst

This needs to be combined with correctly typed parts before or after the error to be usable (so we have ?* ERROR ?*):

hfst-concatenate anystar.hfst edit1.hfst | hfst-concatenate - edit1.hfst | errm.edit1.hfst

In practice though we usually use scripts to create the edit distance automata, since it can be scripted easily.

python editdist.py -d 1 -S -a demo-lm.hfst | hfst-txt2fst -o errm.edit1all.hfst

This python script is also at http://hfst.svn.sourceforge.net/viewvc/hfst/trunk/articles/fsmnlp-2012-spellers/editdist.py?revision=2556&view=markup

To make more edits we can e.g. use repeat (or compose):

hfst-repeat -f 1 -t 7 -i errm.edit1.hfst -o errm.edit7.hfst

In practice you'll want to use edit's between 1 and 3

Just to see how it works, we can see what it does to our words:

hfst-lookup errm.edit1all.hfst 

The language-specific errors can be typed in as a list:

hfst-strings2fst -j -o en-orth.hfst
of:ough  5
...

Remember to end the input by CTRL-D.

To combine with edit distance, we use disjunction:

hfst-disjunct en-orth.hfst edit1.hfst | hfst-minimize > edit1+en-orth.hfst

Now we can combine these two errors with the correctly typed accepting automaton with concatenation to form full better edit model:

hfst-concatenate anystar.hfst edit1+en-orth.hfst | hfst-concatenate - anystar.hfst -o errm.edit1+en-orth.hfst

Now for full word errors, we use again just a list of strings to input:

hfst-strings2fst -j -o en-confusables.hfst
teh:the  2.5
mispelled:misspelt  2.5
...

The final combination between the past error models and these fullword errors is now made with disjuncition, since these fullword errors bypass the regular mechanism totally.

hfst-disjunct en-confusables.hfst  errm.edit1+en-orth.hfst | hfst-minimize errm.everything.hfst

Turning automata into a hfst ospell/voikko/enchant speller for gtk text boxes and finally libreoffice extensions

-- TommiPirinen - 2012-07-21

Topic revision: r3 - 2012-07-23 - TommiPirinen
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback