HFST: an example of compiling omorfi

Tip, idea You may also read the full and up-to-date story at omorfi's home pageand Omorfi wiki

Omorfi is an implementation of Finnish morphology on top of electronic dictionaries of Finnish languages. The system consists of several modules:

  • The combinatorial morphotactics
  • The morphophonological variation
  • The orthographic convention rules
  • The frequency data collection and addition
  • Lexical filtering and customization
  • The hyphenation and syllabification
  • The spell checking and correction modeling

The omorfi build system uses autotools to detect the HFST tools and automake framework for modular building of morphology. This page describes briefly the building of separate modules

Morphotactic combinatorics

morphotax can be found from ``src/morphology``

The general form of morph combinatorics is stub+stem+infection. In file system these are found in ``lexical-data/pos-stubs.lexc``, ``pos-stems.lexc``, ``pos-infl.lexc``. For words that do not inflect only ``lexical-data/pos.lexc``exists. All classes may continue to ``common-infl.lexc``, since it governs clitics and possessive suffixes.

The implementation of morphotax can be found in few lexc files. The central file here is ``omorfi.lexc``. This is why the compilation is basically formed of hfst-lexc command followed by omorfi.lexc first and then all of the other files in arbitrary order.

Morphophonological processing

the application of phonology is in ``src/phonology``

The implementation of morphophonology is done using twol language. The source file omorfi.twol here is self-explanatory and simple so it does not require much documentation. We compile it using hfst-twolc like so.

Orthographical variations

orthography rules are found in ``src/orthography``

There are few minor orthographical corrections made before the automaton is ready to be used as analyser for running text. One is uppercasing, as the first words of sentences are title-cased and this should be defined in analyser rather than external software. Another common problem is that some people prefer to use typewriter era versions of certain letters, such as š and , so the analyser allows the sh and zh form for these. Same treatment is done for hyphens, apostrophes and other marks that can be written in multiple forms.

The actual rules implementing these variations are as simple as two arc automata or character pairs applied with hfst-substitute program. The automata are either defined in regular expressions that can be compiled by hfst-regex2fst. This loosely corresponds to doing read regex and compose net in xfst.

Using frequencies and weights

Weighting data is in ``src/weights``

Omorfi can optionally use real or made-up frequency data to sort analyses for ambiguous cases. The method for applying weights is composition. The weights for word forms can be learned from a plain text corpus by simple script performing ``sort | uniq -c | awk "{print \$2, -log(\$1/$CS}"``, for standard log probabilities of tokens. Similar method could be applied for analysis strings of large corpora, if the analyses were disambiguated. In lack of that, a hand made automaton giving weights to single tags can be used. The example of former can be found in ``fiwiki.weights.strings``, counted from Wikipedia. The examples of latter can be found from ``omorfi.tagweights.regex``. In composition of the weight models you need to take care for correct application of flag diacritics. The compilation of strings is done by hfst-strings2fst, the regular expressions by hfst-regexp2fst again. The composition is done from left to right by hfst-compose in order of tagweights o automaton o surface strings, or vice versa if the automata has been inverted to accommodate for the fact that all HFST automata are upside down by default.

Using dictionary for hyphenation

Hyphenation rules are in ``src/hyphenation``

The hyphenation model is compiled from twol ruleset. The resulting rules can be used as themselves, giving general rulebased hyphenation, or they can be applied on top of one of omorfi dictionaries, to get word and morpheme boundaries as hyphenation points. The dictionary based hyphenation does not hyphenate words unknown to dictionary at all whereas the ruleset can hyphenate any string.

The twol ruleset is compiled with hfst-twolc as usual, but in order to use it as self-standing hyphenation rule automaton, the twolc ruleset needs to be collapsed together by hfst-split and hfst-intersect!

Filtering words and forms

Examples of filter rules are in ``src/filter``

It is possible to extract parts of dictionaries for specific purposes. For example, the spell checking suggestion algorithm designed for schools may not suggest obscenities, and rule-based machine translation may only wish to analyse subset of words it knows. The extraction is simple composition operation, and the filters are therefore always simply automata of languages of wanted forms, usually something like ``?* [FOO] ?*``, to extract all forms tagged ``[FOO]``.

The directory contains example filters for removing guessed words and removing words not in standard dictionary style.

Regular expressions can be compiled by hfst-regexp2fst and twolc-based pruning rules with hfst-twolc

Spelling correction

The spelling error models are in ``src/suggestion``

Omorfi includes transfucer for modeling partially mispelt word forms as filters. The automata in this directory can be used for spell checking systems or for error tolerant analysers. The automata can be composed to omorfi analysers and dictionaries, but since the result is typically raqther large, it is often required to use special tool to perform composition on the fly.

The error models here are text format automata, to be compiled by hfst-txt2fst. The special composition on the fly is implemented by hfst-ospell, for example.

Topic revision: r2 - 2011-01-01 - TommiPirinen
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback