Warning, important Kitwiki pages are no longer updated in favour of googlecode wiki.

Omorfi: an open source morphology for Finnish language

Omorfi is an open source implementation of morphology of Finnish language. It uses GPL licenced HFST tools as implementation of morphological description. The morphology implementation is licenced under GNU GPL version 3, but not necessarily later. Omorfi uses LGPL licenced Nykysuomen sanalista and GPL licenced Joukahainen as primary sources for lexical data.

Downloading

Omorfi project data can be found from gna! service. Omorfi download directory contains release packages. Development version can be found from omorfi’s gna! SVN repository. In Gentoo distribution omorfi can be installed from science overlay.

Dependencies

Installation requires:

  • HFST tools:
    • hfst-lexc, hfst-twolc, hfst-compose-intersect, and hfst-invert are required for build process.
  • hfst-remove-epsilons, and hfst-lookup-optimize is needed to build fast lookup transducers.
  • hfst-diff-test, hfst-optimized-lookup, and python are needed for basic functionality and regression testing.
  • python is needed to rebuild lexc lexicons from CSV data
  • XSLT processor supporting XSLT 2.0 is needed to extract CSV data from XML-based lexical sources:
    • ``java`` must be able to access ``net.sf.saxon.Transform`` in existing env., or
    • script named ``saxon``, ``saxon8``, ``saxon9``, or ``saxonb-xslt`` must execute it

The final transducers can be used with HFST tools, HFST runtime tools, SFST tools or OpenFST tools (the last two are untested but should work).

Installation

Installation uses standard autotools system:

  ./configure && make && make install

The compiling may take forever or more depending on the hardware you are using. The stable versions should be compilable on average end-user laptops, such as my Acer Aspire one.

If configure cannot find HFST tools, you must tell it where to find them:

  ./configure --with-hfst=${HFSTPATH}

Autotools system supports installation to e.g. home directory:

  ./configure --prefix=${HOME}

In CVS or SVN version you must create necessary autotools files in host system:

  ./autogen.sh

For further instructions, see INSTALL, the GNU standard install instructions for autotools systems.

Usage

The final installation contains transducers ???-omorfi.* in the directory specified by configure command. The default $prefix/share/omorfi/ which in typical Linux systems will be /usr/local/share/omorfi/. The installed files are suffixed .hwfst and .hwfst.ol corresponding to weighted and optimized versions of transducers. The former is usable for all sorts of mutations and dynamic transducer operations whilst latter can only deal with one way lookup.

For files of form ???-omorfi.*, the ??? can be one of:

  • mor: for morphological analyser
  • spl: for spell checking
  • hyp: for dictionary backed hyphenation
  • sug: for suggestions of misspelt word forms

The hyphenation transducer is made of morphology and hyphenation relation given in hyphenation.hwfst, which can be used without dictionary to hyphenate word forms not found in dictionary. Likewise suggestion transducer is composed of suggestion relation and dictionary, and the suggestion relation can be tested without dictionary by using e.g. file simple-edit-distance-2.hwfst, which will for any given string generate all strings with Levenshtein distance of 2.

Examples

Assuming tokenised file words with one word per line, analysing can be done by following commands (optimized):

  hfst-optimized-lookup ${prefix}/share/omorfi/mor-omorfi.hwfst.ol < words

To hyphenate the same words with dictionary:

  hfst-optimized-lookup ${prefix}/share/omorfi/hyp-omorfi.hwfst.ol < words

To hyphenate without dictionary:

  hfst-lookup ${prefix}/share/omorfi/hyphenation.hwfst < words

To generate all suggestions for misspelt words:

  hfst-lookup ${prefix}/share/omorfi/sug-omorfi.hwfst < words

This method is very slow however, and does not work with suggestion relation that generates infinite possibilities. An alternate would be:

...

On character codings

The implementation of morphology uses UTF-8 encoded Unicode. This may cause some problems with two word characters which have less ambiguous Unicode variants than their ASCII versions, namely the U+2019 RIGHT SINGLE QUOTATION MARK and U+2010 HYPHEN. The analysis should work with both them and the legacy 0x27 APOSTROPHE and 0x2D HYPHEN-MINUS, but this variation may cause problems.

Programming and project management

Omorfi rulesets and codes are free and libre open source, modifiable and redistributable by anyone. For participation in project it is recommended to follow rules common in majority of free and open source projects, such as GNU project style guide <http://www.gnu.org/prep/standards/standards.html>, and autobook book <http://sources.redhat.com/autobook/> (esp. 9.1.1) and instructions in project’s HACKING file. Bugs should be reported through omorfi’s gna! bug tracker <http://gna.org/bugs/?group=omorfi> system.

Further reading

The documentation here is not updated as often as the ones in Omorfi’s homepage on gna! services. The pages at Gna! services contain many automatically generated document that I haven't had time to integrate with twiki stuff at kitwiki (e.g. the automatic test suite logs).


Topic revision: r5 - 2012-04-02 - TommiPirinen
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback