Omorfi–A sketch for more maintainable lexicon handling and collection

This page currently contains plans for making lexicon updating and usage more maintainable. For background, the initial implementation of omorfi used word data given in Nykysuomen sanalista, and contained some more or less nasty hacks to workaround some shortcomings of the word list for purposes of morphological analyser. Some of these are listed in OMorFiKotusSanalista, and few more in master’s thesis describing initial version of omorfi: Pirinen 2008. It is perceived that these methods of updating and extending the word list as source lexicon for morphological analyser, and possible other tasks as well. For this reason we’ve sketched a possible solution for more effective handling of word data.

Requirements and ideas or concepts

  • adding data to repository should be relatively easy
  • collecting data from repository to final formats should be relatively easy
  • word list is based on needs of morphophonological lexicon, any data irrelevant to morphophonology is not included
    • for future reference, adding data for other applications should be relatively easy and should not interfere with existing morphophonological data
  • word entries need to have globally unique persistent permanent id’s
    • word lists therefore may be used for needs other than morphophonology by referring to these word entries by their id.
  • gathering words from sources with differing paradigm classification and data, e.g Nykysuomen sanalista, Joukahainen, guessed data from corpora...

Diagram of word data flow

The intended processing would be along the lines:


nykysuomen-sanalista => |
joukahainen          => |
corpus data          => |
hand written guesses => |
...                  => | => morphological lexical data => | => lexc,
                                                           | => sfst,
                                                           | => etc. lexicon
                                                           | => applications

Or, in more abstract terms:


unprocessed data     => |
+ analysis           => | => abstracted data repository => | => final
                                                           | => collected
                                                           | => data

Resources usable for word data

Fixed word lists:

Word list* \ *Data Dictionary words Parts-of-speech Inflection marking Licence
Nykysuomen sanalista ~95000 baseforms No 76 inflection classes and 16 gradation LGPLv2
Joukahainen ~15000 baseforms Yes ~100 inflection classes and 6 gradation GPLv3
Käänteissanakirja ~36000 baseforms Yes ~80 inflection classes, gradation 1 or 0 restricted

We also have following processes that automatically produce further data for arbitrary baseforms (even process that supposedly produces guesses with 100 % accuracy needs to be verified though):

Process Data Accuracy description
Suffix based guesser Inflection classification 99 % for new compounds (cf. Lindén 2007), ~90 % for unseen base words (loans, neologisms, ...)
(Other guessers) ... ...

And so forth

Relation between lexical data suggested in LMF formulation and lexc lexicon processing in C++ in HFST lexc

Attached UML chart should be useful to understand similarities between the lexicon data format in XML data source format and the internal object model used by HFST implementation of lexc.

Abstract parts of data format

Since settling to a specific format is not easy, we only list here fragments of data the format should contain in easily digestable format.

Word list

A lexical resource is a word list consisting 0 to n words.

Each word list may have some metadata stuff somewhere.

Word

In lexical resource for morphology, a word is an abstraction similar to dictionary word, holding one meaning, persisting to one inflectional type classification, etc.

Word’s identifier

Each word is identified uniquely by it’s id. One id is persistent and unique. An unique id is determined by dictionary form’s spelling. Cases where two morphologically different creatures have same dictionary form are resolved later.

In XML production, unique persistent id is specified by xml:id attribute, which is globally referable by URL with fragment identifier. The basic format of id is just the dictionary form of the word. The cases where word’s dictionary form does not match production of ID token of XML standard, underscores shall substitute offending characters or parts. The well-formedness of XML constrains uniqueness of ID’s per se.

It should be noted that given a persistent URL for each version of the word list the URL suffixed with fragment identifier will become globally unique reference to persistent word entry data which must be stable.

Part-of-speech

Morphological resources will usually assign parts of speech per word. Should be rather obvious: noun, adjective, verb, particle, adverb and so forth. There might be some unclear cases though.

Inflectional classifications

Inflectional classification is arbitrary classification assigning some inflectional patterns to each word as necessary. Here, inflectional and gradation class is based on nykysuomen sanalista classification where available. For gradation a boolean value is bare minimum, where ultima syllable starts with a stop or potential weak grade.

Other attributes

Most likely the plurality of dictionary form for plurale tantum words and the exceptional pronunciation for citation loans needs to be marked up.

There are certain morphological limitations not derivable from traditional part-of-speech classification: adverbs may take some possessive’s and clitics, certain verbs do not allow all personal forms, certain adjectives do not compare, certain nouns do compare etc.

Possibility for further attributes and elements should be kept in mind.

What constitutes a word entry?

In Nykysuomen sanalista there are cases where two words with same dictionary form and same inflectional data has two entries because word has two senses with assumably separate origins; this is not useful separation for omorfi. Two words with same baseform but different inflectional data however should constitute separate entries, unlike some cases in Nykysuomen sanalista. The case where one entry has two inflections which results from shortcomings of the classification system used in Nykysuomen sanalista is problematic.

Furtermore the format used in joukahainen where one word has multiple forms may not be optimal.

For beginnings the data repository here uses definition of dictionary form as sole basis for one entry. The cases where two separate dictionary words have same dictionary form in orthography are left unspecified, and handled separately where needed (Finnish luckily needs only handful of real collisions).

Concrete file formats

The source formats may be whatever, there always needs to be analysis process for source data to get source data in data repository. Analysis needs to know how to assign the id to data and such. The analysed data is organised so that it is easy to pick up all sorts of collections. The final formats are what different people have found useful.

XML Formats used for data representation

This is an overview of current practices.

Structural element kotus-sanalista joukahainen
Wordlist kotus-sanalista wordlist @xml:lang
Word entry st word @id
Word id N/A @id "w + number"
Part of speech Not explicit ( tn Nominal, verb, personal pronoun, other) wclass adjective, noun, pnoun_(firstname, lastname, place, msic), verb
Inflectional classification tn 1 through 78, 99 or 101 × av ?@valinnainen A through M infclass aavistaa through vuotaa × av1 through av6
word form s form
other hn (homonymity), t @taivutus (commonness) usage, info, style, application, infclass @type, ...
word list : entry 1 : N 1 : N
entry : id N/A 1 : 1
entry : pos 1 : N 1 : N
entry : inflection 1 : N 1 : N
entry : word form 1 : 1 1 : N

Lexc format

Lexc is historical format for representing morphotactics, and is one of the main result formats here. To get lexc format dictionaries from XML sources some pre-processing is required to implement morphophonemics and further morphotactics. This preprocessing is currently done in python. Python implementation does not allow reading big XML files for processing, so intermediate format is required. CSV format is uszed to implement lexc conversion in omorfi.

References

Pirinen, 2008
Suomen kielen automaattinen morfologinen jäsennin avoimen lähdekoodin menetelmin Master’s thesis (In Finnish; available in PDF: http://www.helsinki.fi/~tapirine/gradu/Pirinen2008.pdf).

-- TommiPirinen

Topic attachments
I Attachment Action Size Date Who Comment
SVG (Scalable Vector Graphics)svg lexical-resource.svg manage 155.4 K 2008-12-15 - 08:01 UnknownUser UML chart of HFST LEXC and LMF
Topic revision: r9 - 2010-04-19 - TommiPirinen
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback