Omorfi–SFST implementation of word form morphology of Finnish

Warning, important NB: this is not the newest version of omorfi but an initial attempt to build Finnish morphology with open source tools.

SFST version of omorfi is one functional implementation for morphological analysis and generation of word forms of Finnish. It is not very polished and contains some known flaws, but if you wish to use it in scientific work or read some sort of a technical report of it, refer to my Master’s thesis (Pirinen 2008). Use one of following to cite:

@mastersthesis{pirinen2008,
title = {Suomen kielen \"{a}\"{a}rellistilainen automaattinen morfologinen analyysi avoimen l\"{a}hdekoodin menetelmin},
year = {2008},
author = {Tommi Pirinen},
school = {Helsingin yliopisto}
}

or, if you really require English citation:

@mastersthesis{pirinen2008,
title = {Automatic Finite State Morphological Analysis of Finnish Language Using Open Source Resources (in Finnish)},
year = {2008},
author = {Tommi Pirinen},
school = {University of Helsinki}
}

An alpha version of the project can be found in CVS directory of corpus server (see OmorfiVersionControl). Installation instructions are in file README. CVS does not contain ready made installation scripts, they need to be created using GNU autotools, e.g. with autoreconf -i.

The final transducers may be available in /l/contrib/appl/ling/koskenni/omorfi/. They are installed using following:

[tpirinen@corpus3 kotus-sanalista]$ ./configure --prefix=/l/contrib/appl/ling/koskenni/omorfi/ --datadir=/l/contrib/appl/ling/koskenni/omorfi/
[tpirinen@corpus3 kotus-sanalista]$ make 
[tpirinen@corpus3 kotus-sanalista]$ make install
[tpirinen@corpus3 omorfi]$ ./configure --prefix=/l/contrib/appl/ling/koskenni/omorfi/ --datadir=/l/contrib/appl/ling/koskenni/ --enable-guesser --with-kotus-sanalista=/l/contrib/appl/ling/koskenni/omorfi/kotus-sanalista/kotus-sanalista_v1-r1.xml
[tpirinen@corpus3 omorfi]$ make 
[tpirinen@corpus3 omorfi]$ make install


Below are semi-automatical rst2xml2twiki (i.e. make twiki in doc/) translations of documentation found in package source tree under:

  • README
  • doc/inflection.rst
  • doc/derivation.rst
  • doc/compounding.rst
  • doc/modules.rst
  • HACKING
  • TODO
  • doc/modulechart.svg



Omorfi–Open Morphology for Finnish language

This package contains free and open source implementation of morphological analysis for Finnish language. It uses GPL licenced SFST as implementation language. This package is licenced under GNU GPL, LGPL and AGPL version 3, but not necessarily later. Licences can be found from files COPYING.*. Other licences are possible, and can be given by authors found in AUTHORS file.

Downloading

Omorfi can be found from gna! service(http://gna.org/projects/omorfi). Omorfi download directory(http://download.gna.org/omorfi/) contains release packages. Development version can be found from gna! SVN server.

Stable version of materials by University of Helsinki, RILF and all, can be found from centre of scientifical computing servers. The development version is in CVS-repository of corpus.csc.fi under /c/appl/ling/koskenni/cvsrepo(http://kitwiki.csc.fi/twiki/bin/view/KitWiki/OmorfiVersionControl).

In Gentoo(http://www.gentoo.org) Linux omorfi can be installed from science overlay(http://overlays.gentoo.org/proj/science) using portage:

layman -a science
layman -s science
emerge omorfi

or correspondingly using paludis:

paludis --install omorfi

Dependencies

Installation requires:

SFST, at least version 1.1, or compatible:

  • fst-compiler-utf8, fst-compact, and fst-lowmem executables are needed
  • kotus-sanalista(http://downloads.gna.org/omorfi/), version 1a or later
  • tr
  • sed
  • XSLT processor supporting XSLT 2.0, tested with Saxon8:
    • java must be able to access net.sf.saxon.Transform in existing env., or
    • script named saxon, saxon8, saxon9 or saxonb-xslt must execute it

The final transducer can be used with SFST 1.1 or compatible.

Installation

Installation uses standard autotools system:

./configure && make && make install

If configure cannot find XSLT 2.0 processor, SFST or kotus-sanalista, they must be supplied it using configure parameters. For more information, execute:

./configure --help

Autotools system supports installation to e.g. home directory:

./configure --prefix=${HOME}

In CVS or SVN version you must create necessary autotools files in host system:

autoreconf -i

It is a common practice not to store autotools gunk in version control system.

For further instructions, see INSTALL, the GNU standard install instructions for autotools systems.

For example, a typical installation session in corpus3.csc.fi:

[tpirinen@corpus3 omorfi]$ autoreconf
[tpirinen@corpus3 omorfi]$ ./configure --prefix=$HOME --with-kotus-sanalista=$HOME/kotus-sanalista-1a.xml --enable-guesser
[tpirinen@corpus3 omorfi]$ make
[tpirinen@corpus3 omorfi]$ make install

Usage

The final installation contains transducers omorfi and guesser in directory specified by configure command, by default $prefix/share/omorfi/, which in typical Linux system will be /usr/local/share/omorfi/. The installed files are suffixed .sfsta, .sfstc, and .sfstl, corresponding standard, compact and lowmem transducers. The first of these will work with all transducer applications, while the two others are order of magnitude more effective in analysis, but do not work for generation.

Tokenised file, one word per line can be analysed with:

fst-infl2 ${prefix}/share/omorfi/omorfi.sfstc tiedosto.words

It is also possible to use fst-infl with omorfi.sfsta or fst-infl3 with omorfi.sfstl, but these are slower.

Interactive interface can be launched with:

fst-mor ${prefix}/share/omorfi/omorfi.sfsta

All known word forms can be generated with:

fst-generate ${prefix}/share/omorfi/omorfi.sfsta

Guesser guesser.sfst{c,a,l} works by taking arbitrary input and trying to guess all morphological data. It can also generate arbitrary forms from given base forms and morphological data. The guesser is not installed by default, since compiling it takes forever and ever. It can be included with configure parameter --enable-guesser.

Python interface to transducers

SFST transducers can be used with pysfst(http://gna.org/projects/pysfst/) via python, the lib/ directory contains an omorfi class and few scripts using it. Class is installed with omorfi into system’s site-packages and can then be used in python by creating an instance using omorfi(filename) constructor. The use examples in lib/ should be enlightening enough to get started.

On character codings

Omorfi prefers Unicode character set over legacy ASCII and in case of current SFST implementation this also means use of UTF-8 encoding of characters. The Unicode characters to pay attention to are apostrophes and hyphens; the U+2019 RIGHT SINGLE QUOTATION MARK and U+2010 HYPHEN are preferred over legacy 0x27 APOSTROPHE and 0x2D HYPHEN-MINUS, while latter two might occasionally work as well.

Programming and project management

Omorfi rulesets and codes are free and libre open source, modifiable and redistributable by anyone. For participation in project it is recommended to follow rules common in majority of ree and open source projects, such as GNU project style guide(http://www.gnu.org/prep/standards/standards.html), and autobook book(http://sources.redhat.com/autobook/) (esp. § 9.1.1) and instructions in project’s HACKING(HACKING.html) file.


This document was automatically generated from ../README


Marking inflection in omorfi

This document sketches conjugation and declination in omorfi. It contains descriptions of word classification used in lexicon source, the word form names and their symbols in morphology and their implementation.

Omorfi is based on finite state transducers, so the morphology uses a mapping between word’s dictionary form and morphological tags to inflected form of the word. The former is called analysis level of the transducer and latter is called generation level of the transducer. Because the end applications are often rather depnedent on specific format of analysis level, I attempt to give exact specification of that level in this document. The documentation here refers to full analysis format used in omorfi, for some applications part of the analysis tags may be absent. A typical pair of analysis and generation looks like word<tag><tag><tag>:wordsuffixsuffixsuffix. The full specification of the pairs is:

word(<wb>word)* <pos><number><gradation><inflection>* (<derivation><inflection>*)*

where:

word is any dictionary word form

<wb> is literal <wb>, a word boundary for compounds(compounding.html)

<pos> is part of speech tag, one of <noun>, <verb>, ...

<number> is inflection class number, usually one of <1> through <78>, <99> or <101>

<gradation> is gradation class letter, one of <a> through <m>, <n>, <o>, <p> or <t>

<inflection> is any morphological analysis tag, described in detail later

<derivation> is derivational tag, described in detail in another file called derivation(derivation.html)

Parts of speech

Parts of speech that are morphologically motivated are tagged in omorfi. Parts of speech tags are four letter symbols. The morphological meaning, i.e. inflections are described later per class.

Symbol Meaning Example
<noun> noun (substantiivi) kutoa
<verb> verb talo
<adje> adjective kaunis
<nume> numeral satakaksikymmentäkolme
<adve> adverb nopeasti
<intj> interjection raah raah plääh
<part> particle  
<conj> conjunction että jotta koska kun
<pron> pronoun minä

Noun declination

Nouns have 16 cases in singular and plural, combined with any possessive suffix, combined with any clitics. Total is some thousands of word forms per word. The format of noun analysis string is:

word(<wb>word)* <noun><1-49><number><case><poss>?<clit>?

Noun classes

The classification numbers of nouns tell phonological variation combinations and case allomorph targets.

Symbol Example
<1> valo
<2> valtio,
... ...

Number

Nouns have singular and plural number.

Symbol Meaning Example
<sg> Singular valo
<pl> Plural valot

Case

Nouns declinate in cases.

Symbol Meaning Example
<nom> Nominative valo
<ptv> Partitive valoa
<acc> Accusative valon
<gen> Genitive valon
<ine> Inessive valossa
<ela> Elative valosta
<ill> Illative valoon
<ade> Adessive valolla
<abl> Ablative valolta
<all> Allative valolle
<ess> Essive valona
<tra> Translative valoksi
<abe> Abessive valotta
<cmt> Komitative valoine
<ins> Instructive valoin

Possessive suffixes

Nouns can take any posessive suffix in any case.

Symbol Meaning Example
<sg1> First person singular valoni
<sg2> Second pers. singular valosi
<sg3> third person singular valonsa
<pl1> First person plural valomme
<pl2> Second pers. plural valonne
<pl3> Third person plural valonsa

Verb conjugation

Verbs conjugation includes two tenses, past and non-past, and three modes, conditional, potential and imperative. These combine with personal forms or passive form. Participial and infinitival derivation is included for three forms of infinitives and three forms of participles. Format of verb analysis string for finite verb forms is:

word(<wb>word)* <verb><<52-78><genus><mode><tense>?<person><clit>?

or for infinite form:

word(<wb>word)* <verb><52-78><genus><infinite><nominal inflection>

Where nominal inflection has same pattern as specified earlier in noun inflection.

Verb classification

Verb classification describes morphophonological variation in word stem.

Symbol Example
<52> kutoa
<53>  
... ...

Verbal genus

Verbs conjugate in two genii, active and passive.

Symbol Meaning Example
<act> active kudon
<pss> passive kudotaan

Tense

Verbs conjugate in two tenses. Tense only makes sense in indicative mode, for other cases it is ignored.

Symbol Tense Example
<pres> non-past kudon
<past> past kutoi

Mode

Verbs conjugate in four modes (indicative unmarked and has two tenses).

Symbol Meaning Example
<indv> indicative kudon
<impv> imperative kudo
<cond> conditional kutoisin
<potn> potential kutonen

Person

Verbs conjugate in six personal forms in active and one in passive.

Symbol Meaning Example
<sg1> First pers. singular kudon
<sg2> 2nd person singular kudot
<sg3> Third pers. singular kutoo
<pl1> First pers. plural kudomme
<pl2> 2nd person plural kudotte
<pl3> Third pers. plural kutovat
<pe4> Fourth person kudotaan

Infinite forms

Verbs also have noun derivations called infinitives, three of which are considered fully productive conjugation, one is derivation and one combinational derivation included for historical purposes (used to be s.c. fifth infinitive). Four productive adjective derivations called participles are also included.

Symbol Meaning Example
<infa> A infinitive kutoa
<infe> E infinitive kutoen
<infma> Ma infinitive kutomassa
<infmaisilla> V infinitive kutomaisillani
<pcpnut> Nut participle kutonut
<pcpva> Va participle kutova
<pcpma> Agent participle kutomani
<pcpneg> Negated participle kutomaton
<Dminen> IV infinitive kutominen

Cases

The case used in short form of an a infinitive is historically a lative, and here marked appropriately as such. Otherwise lative only exists with adverbs.

Symbol Meaning Example
<lat> lative kutoa

Numerals

Numerals inflect like nouns. Some numerals fit into regular noun inflection classes, but some do not. The numerals with class number less than 50 are mostly same as nouns. Numerals have exceptional compounding rules(compounding.html). Numerals have same analysis string as nouns, with exception of replacing <noun> with <nume>.

Symbol Example
<9> sata
<10> miljoona
... ...

Adverbs and other ad words

Adverbs in common Finnish grammars are used for tons of different things, including stuff that is not ad verb, but ad something completely different. I’ve attempted to classify adverbs against at least language historical morphological features, since majority of adverbs are lexicalised noun forms. Many of the classes are not verified but morphosemantically usable.

One of the variant morphological feature of adverbs is its partial inflection; habitive adverbs (e.g. mainly sti but not all) have comparation and clitics, locative adverbs have partial locative cases, possessives and clitics, temporal adverbs have only clitics. Prolatives and similar (e.g. yli ~ ylitse) may only have clitics as well. Lots of inflected forms of adverbs is further lexicalised into more adverbs (i.e. all forms of one adverb have dictionary entries). Intensifying adverbs might not assume clitics at all.

This array is not used yet and subject to change.

Symbol Meaning Example
<sti> adverb der. nopeasti
<loc> locative kotona
<sep> separative kotoa
<lat> lative luo
<tra> translative luokse
<ine> inessive upoksissa, päissään
<ade> adessive siellä
<abl> ablative täältä
<ill> illative aamutuimaan
<ins> instructive hiljan
<prl> prolative meritse
<tmp> temporal silloin
<cau>? causative siten
<dis> distributive talottain
<tmp><dis> temporal distributive maanantaisin
<sit>? situative nokakkain
<opp>? oppositive nokatusten

Particles, interjections

Particles and interjections do not inflect. Particles have analysis string of form:

word<part><99>

and interjections of form:

word<intj><99>

Inflectional class

Symbol
<99>

Enclitics

Clitics are suffixes which can attach almost anywhere in the ends of words. One exception is verbal enclitic -s, which attaches to a few verb forms or after another enclitic only (e.g. tules, talopas).

Symbol Meaning Example
<clit> clitic particle valoko, valokaan, ...


This document was automatically generated from inflection.rst

Sketch of derivational morphology in omorfi

This table sketches some plans on derivational morphology I’m interested in. It is far from final.

Suffix Base Deriv. Productiv Examples
jA Verb Noun Open tekijä, syöjä
IAinen Noun Noun Closed ? kummajainen
(U)ri Noun Noun Closed ? urkuri, tyhmyri
(i)kkO ??? Noun Rajall. häirikkö
iO ??? Noun Closed hirviö, olio
(U)s Noun Noun Closed typerys
kAs Noun Noun Closed ? ehdokas, hyväkäs
lAinen Noun Noun Open aikalainen
tAr Noun Noun Open ystävätär
in Verb Noun Open laskin
(U)ri   Noun Rajall. kaivuri
Verb Noun Open peite, pyyhe
minen Verb Noun Open tekeminen
mA Verb Noun Closed elämä, teelmä
nA Verb Noun Rajall. kiljuna
nti Verb Noun ~Open syönti
ntA Verb Noun Rajall. etsintä
nnAinen Verb Noun ... liitännäinen
ntO Verb Noun ... asunto
O Verb Noun   ajo, lento
U Verb Noun   juoksu
uu Verb Noun Closed kaivuu
s Verb Noun ... ihastus (jalas)
Os Verb Noun   ostos, kiitos
Us Verb Noun   kirjoitus
mUs Verb Noun   väsymys
Verb Noun Open aie, katse
i Verb Noun Closed anti, aisti
iO Verb Noun Closed tappio
mO Verb Noun Rajall. korjaamo
lA Noun Noun Rajall. kanala
mA Noun Noun Closed reunama
lmA Noun Noun Closed lahdelma
(i)kkO Noun Noun Rajall. koivikko
(i)stO Noun Noun ~Open kortisto
Ueˣ Noun Noun Closed laivue
isO Noun Noun Closed nuoriso
(i)nen Noun Noun ~Open poikanen
keˣ Noun Noun Rajall. kieleke
kkA Noun Noun Rajall. kännykkä
kAinen Noun Noun Rajall. neitokainen
Us ~ UUs Adjective Noun Open heikkous, pahuus
iO Noun Noun Closed joukkio
lO Noun Noun Closed vartalo
s Noun Noun Closed koiras, eväs
eA Noun Adject. Closed pyöreä, makea
inen Noun Adject. Open rautainen
llinen Noun Adject. Open kannellinen
kAinen Noun Adject. Closed ainokainen
tUinen ? Adject. Closed erityinen
vAinen Verb Adject. Open kuolevainen
immAinen Noun Adject. Closed alimmainen
nnAinen Verb Adject. Closed valinnainen
llOinen Noun Adject. Closed kivulloinen
einen Verb Adject. Open * -selitteinen
iAs Verb Adject. Open puhelias
isA Noun Adject. Rajall. kalaisa,
vA Noun Adject. Closed väkevä, juhlava
lAs Noun Adject. Closed vuolas
tOn Noun Adject. Open taloton
mAtOn Verb Adject. Open tekemätön
hkO Adject. Adject. Open laihahko
Ut Adject. Adject. Closed kevyt
lAinen   Adject. Open suurenlainen
mAinen Noun Adject. Rajall. akkamainen
mOinen Pronoun Adject. Closed jonkinmoinen
nAinen Noun Adject. Closed täysinäinen
Uinen Adject. Adject. Closed korkuinen
tA-   Verb ?Open päästää, jäätää
A- Noun Verb ?Open lainata
ntA- Noun Verb ?Open todentaa
(i)stA- Noun Verb ?Open avustaa
ttA-   Verb   hengittää
O(i)ttA Noun Verb   hajottaa
UttA- Verb Verb   arveluttaa
(eh)ti- Noun Verb   kahlehtia
ttA- Verb Verb Open teettää
UttA- Verb Verb Open maalauttaa
AhtA- Verb Verb   naurahtaa
Aise- Verb Verb   kysäistä
AltA- Verb Verb Closed painaltaa
ele-, ile- Verb, Noun Verb   hypellä, vastailla
ntele Verb Verb Closed juoksennella
skele Verb Verb Closed lueskella
skentel Verb Verb Closed käyskennellä
elehti Verb Verb Closed hyppelehtiä
ksi Verb Verb Closed lueksia
i   Verb   kukkia
O   Verb   aukoa, arpoa
Oi Noun Verb Open esitelmöidä
itse   Verb   lukita
Oitse Noun Verb   tupakoida
ise Onom. Verb   humista
AjA Onom. Verb   humajaa
ksi Adjekt. Verb   halveksia
ksU Adjekt. Verb   paheksua
ne Adjekt. Verb   pidetä
A verbi Verb   aueta
U Verb Verb   kääntyä
pU   Verb   saapua, juopua
tU   Verb   valikoitua
htU   Verb   menehtyä
UtU   Verb Open aiheutua
VntU   Verb   ikääntyä


This document was automatically generated from derivation.rst

Sketch of compounding in omorfi

This document aims to define rules of compounding possibly used in omorfi. Compounding as defined here is overgenerating, so other methods must be used to reduce the number of compounds if used in e.g. spell checker or somesuch.

First word Second word Compound type example
genitive+ noun noun talonpoika, isänisän...isä
nominative noun noun banaaniovi
compound part noun noun hevostyttö

Numeral compounding is fully productive and known and implemented at arbitrary lengths.


This document was automatically generated from compounding.rst

Module layout for omorfi

module link chart: See attached image:

Omorfi code is split in two main parts: the rulesets and lexicon handling. The rulesets contain rewriting rules, concatenateable word parts and affixes and filters. The lexicon handling modules contain lexicon reading functionality and application of the cascade of rulesets. The graphical representation of module layout is described in file modulechart(modulechart.svg) (module chart is same svg file as above, but browsers like firefox may have problems displaying svg inline(https://bugzilla.mozilla.org/show_bug.cgi?id=276431)). The rules in this version are further split in 6 modules:

plurale-tantum.sfst: finds plurale tantum words from root lexicon and recreates singular forms

find-gradation.sfst: finds and marks gradation on dictionary word.

stubify.sfst: replaces variant parts of dictionary forms with zeros to get invariant stub forms

stemfill.sfst: adds variant parts to stubs to get inflectional stems

inflection.sfst: adds inflectional suffixes and derivational endings to correct stems

phonology.sfst: realises correct forms of morphophonology, i.e. gradation, vowel harmony, assimilations etc.

The lexicon building uses these six modules to get from lexicon forms to full morphological analyser. Typical lexicon builders consist a few modules, such as the main omorfi lexicon:

omorfi_1.sfst: reads lexica from sfstlex type files, applies above mentioned modules in order and produces first pass inflectional forms.

omorfi_2.sfst: reads participles, comparatives, superlatives and derivations(derivation.html) from the first pass inflectional forms and produces second pass inflectional forms.

omorfi_n.sfst: it is possible to make third...umpteenth pass of derivational morphology by copying output of second pass to input of nth pass.

exceptions.sfst: reads exceptional word forms from exception lexicon and processes them through partial inflection.

compounds.sfst: reads word forms from previous modules and combines them together using some compounding rules(compounding.html).

omorfi.sfst: collects inflectional forms from modules 1, 2, ..., n, exceptions and compounds.

Some dependencies are not actually compiled, but included in process:

alphabet.sfst: defines all character sets and multicharacter symbols used in the sytem.

stemparts.sfstlex: SFST style lexicon of variant stem parts

inflection.sfstlex: SFST style lexicon of inflectional(inflection.html) endings

derivations.sfstlex: SFST style lexicon of derivational(derivation.html) endings

exceptions.sfstlex: SFST style lexicon of exceptional wordforms

Bullet list ends without a blank line; unexpected unindent.

d The guesser, which is basically an open lexicon with word roots replaced by regular expression matching roughly any alphanumeric string, uses similar but simpler module layout:

guesser_1.sfst: generates word root lexicon from regular expressions, applies the rulesets from midway onwards and produces first pass inflectional forms.

guesser_2.sfst: exact parallel of omorfi_2.sfst for open-ended lexicon.

guesser.sfst: exact parallel of omorfi.sfst

The src/ directory may contain some modules not part of the actual process:

kotus-sanalista.sfst: word list reading and dumping, used for testing system with partial word lists

symbols.sfst: alphabet reading and dumping, used for testing external systems who use different alphabet systems, such as OpenFST(http://openfst.org).

Following is documentation of each module, it may not be as up to date as possible. For exact information it is always advisable to read the source codes. The compilation times are here for reference only, they were measured with omorfi 0.2 beta 1 on 31st of March 2008 on my Linux with _i686 Intel(R) Celeron(R) M CPU 430 @ 1.73GHz_ (according to uname). Compilation time was provided by bash time(1), under name of user time. Compilation was done using SFST-1.2 built in Gentoo system using g++ -march=pentium4m -O2. Another set of compilation times for corona(http://www.csc.fi/english/research/Computing_services/computing/servers/corona), 96 dual core processor solaris supercomputer with 384 GB memory at csc.

Plurale tantum module

Plurale tantum is simple module to handle the fact that plurale tantum words are in dictionary in plural form whereas my system expects to handle singulars in the actual process, so this recreates singulars from plurals. In practice it is a cascade of replace plural against words in lexicon that are known to be plural, e.g. t:0 in context of <noun><1>...<PLT> and so forth.

compilation times:

Mine: 0m26.530s corona: 2:16.4

Find gradation module

Find gradation module finds the gradating stop from the lexicon word based on given gradation class and inflectional class. The inflectional class determines whether the gradating sound is in weak form or strong, and the gradation class then tells us which sound are we looking for. And in majority of classes it is the rightmost applicable sound gradating, since radical gradation applies ultima syllable of stem. The marking is pair of form t:<~t> and such.

Finding of gradation class D in weak form is a bit complex, since the weak form in d is zero, i.e. we need to find boundary of stem ultima syllable. Also, some special cases are needed for verbs which have suffixal t after gradating t.

compilation times:

mine: 0m27.250s corona: 2:16.9

Stubify module

Stubify cuts variant parts from lexicon words to make invariant stubs. In process of cutting all potentially important information is stored for further use, e.g. vowel quality for vowel harmony. Here e.g. when word käsi inflects käsi : käden : kätten : käsissä we cut si away.

compilation times:

Mine: 5m5.280s corona: 18:13

Stemfill module

Stemfill contains parts of stem to fill the stubs from previous phase to produce typical inflectional stems. E.g. for käsi example stub was , so here we produce nominative stem käsi, singular stem kä<~t>e, consonant stem kät and plural stem käs. This happens by concatenating those four stemparts (from stemparts.sfstlex) against all words and filtering.

compilation times:

Mine: 0m0.590s corona: 1.7

Inflection module

Inflection module has inflectional and derivational suffixes to concatenate to stems made with stemfills in previous module.

compilation times:

Mine: 0m1.920s corona: 0.0

Phonology module

Phonology module contains rules to realise morphophonemes possibly used in other parts of the system, such as vowel harmony, assimilation, consonant gradation and lengthening.

compilation time:

Mine: 1m6.410s corona: 4:31

Omorfi first pass module

First pass module of omorfi reads the lexicon, preprocesses it, the applies in order modules described earlier: plurale-tantum, find-gradation, stubify, stemfill, inflection and phonology. After that module contains miscellaneous cleanup and storing of derived forms for second pass.

compilation time:

Mine: 2m10.860s corona: 11:20.6

Omorfi second pass module

Second pass reads participles, comparatives, superlatives and other derived forms to create new lexicon, and processes it like first pass to create full set of word forms for derived words. Majority of file should be the same as first pass, except for lexicon reading. It is possible to reuse second pass module to make third pass, i.e. derive from derived words, but such derivations are rather uncommon.

compilation time:

Mine: 3m29.020s corona: 29:00.7

Omorfi exceptions module

Exceptions module reads minilexicon of really exceptional wordforms that do not fit nicely with usual processing, and processes them through modified path from that of omorfi first pass, making partial inflection available.

compilation time:

Mine: 0m1.790s corona: 6.2

Omorfi compounds module

Compounding module reads word forms from first and second passes of lexicon building and combines them to make possible compounds.

compilation time:

Mine: 3m20.320s corona: 12:15.0

Omorfi main module

Main module collects words from first pass, second pass, exceptions pass and compounds into single transducer. Main use for omorfi module is post processing, you can filter or pull out specific stuff from omorfi without need to recompile whole system.

compilation time:

Mine: 1m9.590s corona: 3:58.3

Guesser first pass module

Guesser first pass module creates open lexicon and processes it through the same path as omorfi first pass module would.

compilation time: 12m41.850s corona: 44:17.9

Guesser second pass module

Guesser second pass module does exactly as omorfi second pass module, except it uses the guesser’s first pass results as participles, comparatives, superlatives and derivations.

compilation time: 96m56.070s

Guesser main module

Guesser main module collects guesser’s first and second passes similarly as omorfi’s main module.

compilation time: 1m56.500s


This document was automatically generated from modules.rst

Coding style and instructions

This file gives instructions on modifying omorfi for your needs. The module based documentation is in doc/ directory under modules.rst.

Improving and releasing

The toplevel makefile supports command make check for regression testing. It must be used and passed before marking a new release. Ideally, when making modifications all commited versions should pass make check, but for development versions this is not crucial. If make check fails, the transducers are not guaranteed to parse Finnish at all.

The regression test data is in test/ directory, in files regression-*.csv, the format is simple: wordform, baseform, analyses. Where analyses may be subset of actual analyses (see lib/omorfi-test.py for details of implementation).

Extensions to derivational morphology

Add new <D tag to alphabet, add new derivation line to derivations.sfstlex and if necessary, add new rules to support it elsewhere.

Modifying final analysis level

The final analysis level in default transducer is rather verbose, containing word boundaries, parts of speech, inflectional classes, gradation classes and morphological analyses including possible derivational classes. For many applications you only need one or two of these analyses. The solution is to filter them out in omorfi.sfst (or guesser.sfst) with ALPHABET hack, replace rule or hand written replace rule (see below).

Leaving parts out

Some of the derivational rules may not be well thought, broken, or harmful to application type. One useful but hacky solution to leave a derivational type out is to remove the symbol for that derivation from alphabet.sfst. SFST will silently lose all words containing symbols it does not know about.


This document was automatically generated from ../HACKING

Known bugs, problems and deficiencies

This sketch is only manually written TODO list for the authors. For up-to-date information you are advised to check the omorfi project bug tracker at(https://gna.org/bugs/?group=omorfi)

Untrivial bugs or issues

  • compounding issues:
  • adjective classification is missing
  • slang words included
  • reflexive pronoun limitations: sinunlaiseni? minunlaisesi?
  • pronouns (even with clitics) are in nominal classification:
kumpikin<16> vs. prototype vanhempi<16>
  • adjectives have possessive inflection (usually should not)
  • nouns have comparative inflection (usually rare)
  • word list contains some defective adverbial roots:
taka<9> only inflects in historic locative cases and does not include nominative as unbound morph
  • class 99 not completely classified or implemented
  • class 101 not implemented
  • Proper names are not implemented
  • Abbreviations are not implemented

Trivial bugs or issues

  • gradation D k : 0 in some vowel contexts the apostrophe variant is incorrectly disallowed.
  • many loan words, neologisms or over trisyllabic words should allow broken vowel harmony
  • some loan words with non-Finnish alphabet might fail unexpecetedly
  • verb implementation is immensely stupid: some of the stems (VVENFI, VVPPAS) patch for only one or two inflections with few exceptional forms
  • In class 49 gradation is unmarked for vowel stems. Vowel stems actually belong to class 48. The consonant stems actually belong to class 3x
  • In class 49: hepene : ?hepen : ?heven
  • weather verbs should have defective paradigms: tuulla : ?tuulen
  • verb olla is not regular 67
  • in class 28: gradation is unmarked
  • in class 18: words ending ilmeinen should be in class 38
  • in class 54: pieksää, lypsää should be in 53


This document was automatically generated from ../TODO on 2008-04-26T15:11:11+03:00

Topic attachments
I Attachment Action Size Date Who Comment
SVG (Scalable Vector Graphics)svg modulechart.svg manage 56.7 K 2008-04-01 - 10:30 UnknownUser Module chart in SVG format
Topic revision: r23 - 2010-01-31 - TommiPirinen
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback