OMorFi: A list of resources

Processed lexical resources

  • Nykysuomen sanalista: A list of Finnish headwords with inflectional codes. See for the official distribution, and a copy of it on Corpus server in the directory /c/appl/ling/koskenni/omorfi/kotus-sanalista_v1/. In addition to the list of headwords, there is a document explaining the codes used in the list.
  • Joukahainen: A GPL based word list used for Voikko speller among others.
  • Any other material free from copyright and other impeding conditions.

Corpora for discovery of further data

  • A collection of Finnish language text corpora (Suomen kielen tekstipankki, kokoelma B, see for more details) which can be used for research purposes and even for building commercial language technology products which do not compromise the copyright of the texts. Lists of different word-forms (types) is also available with the token frequencies of those types. See the directory /l/kielipankki/words/sktp/ in Corpus server. The relevant file for word-form frequencies is:
    -rw-r--r--  1 ling sktp-b 81188643 18. touko   2006
    • (Do note that the morphological tags even in B licenced FTC data may not be as widely usable as the words per se; that is, only words can be used freely)
  • The gutenberg project contains archive of texts whose copyrights have expired and thus are free for any use.
  • Wikipedia database dumps provide huge base of open source written (encyclopedic) texts.

Topic revision: r5 - 2009-12-08 - TommiPirinen
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback