OMorFi: Lexicon design principles

This page is for designing and discussing how the lexical resources should be structured in this project. The lexical resources consists of:

  • word entries as given in the word list of Kotus,
  • information mechanically deduced from the entries in the word list,
  • individual manual corrections of or additions to the word list,
  • data on compounds, derived words, proper names etc. collected from text corpora,
  • data collected from users and visitors through web interfaces,
  • separately compiled grammatical information (such as valencies) of lexical entries, etc.

We expect, thus, that several kinds of information to accumulate as the work proceeds. In specific applications, some of these components will be used. Combining them all into a single database could be difficult and result in overly complex structures. Therefore, we proceed with an assumption that the lexical resources.

Format and linking of the resources

The word list of Kotus is in XML format, and there are good reasons to use XML formats for the storage and exchange of other lexical resources as well. The XSLT 2.0 provides reasonable facilities to transform and combine XML documents also by using regular expressions and there are freely available implementations such as saxon8 for it.

Unique IDs

Combining two or more XML documents is practical if one has unique IDs which mark word entries etc. The base form of the word could serve as the ID, augmented with a (homonym) number if necessary. Such an ID would be transparent for humans and easy to maintain even if the vocabulary grows.

One could use such IDs e.g. for resolving the ambiguity between nouns and adjectives in the original word list:

  1. Produce some tentative comparative and/or superlative forms out of such entries.
  2. check which of those forms occur in a corpus.
  3. Those lexemes having sufficiently many matches are deemed as adjectives and a corresponding XML element is created for them in a separate file.
  4. An XSLT transformation combines these two and writes a word list with these tentative corrections to be used in building morphological analyzers.

Uses of the IDs

One could use the IDs e.g. for:

  • linking parts of compounds to the component lexemes,
  • linking derived lexemes to their base word,
  • writing different senses of a word in a separate file and linking them to the lexeme and inflectional information in the original word list,
  • describing the valency information for verbs in a separate file and linking those entries to the original list where the inflectional information is
  • computing frequencies in a corpus and including them in a morphological analyzer.

XML:id’s in current kotus-sanalista

Current version of kotus-sanalista scripts in CVS contains xmlattribuutit.xslt which includes a tentative id script for words. It adds attribute xml:id (cf. W3C standard XML:id) to each word xml element st in sanalista of form baseform(-homonym number)?, where homonym number is added to when necessary make id’s globally unique. Also, because IDTOKEN production must start with Letter as defined in unicode standard, word’s where baseform does not start with a Letter are prefixed with underscore, which is explicitly allowed by standard for this kind of cases. -- TommiPirinen - 12 Nov 2007


-- KimmoKoskenniemi - 15 Jun 2007
Topic revision: r2 - 2007-11-12 - TommiPirinen
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback