Processed lexical resources

  • Nykysuomen sanalista: A list of Finnish headwords with inflectional codes. See for the official distribution, and a copy of it on Corpus server in the directory /c/appl/ling/koskenni/omorfi/kotus-sanalista_v1/. In addition to the list of headwords, there is a document explaining the codes used in the list.
  • Joukahainen: A GPL based word list used for Voikko speller among others.
  • Any other material free from copyright and other impeding conditions.

Corpora for discovery of further data

  • A collection of Finnish language text corpora (Suomen kielen tekstipankki, kokoelma B, see for more details) which can be used for research purposes and even for building commercial language technology products which do not compromise the copyright of the texts. Lists of different word-forms (types) is also available with the token frequencies of those types. See the directory /l/kielipankki/words/sktp/ in Corpus server. The relevant file for word-form frequencies is:
    • (Do note that the morphological tags even in B licenced FTC data may not be as widely usable as the words per se; that is, only words can be used freely)
  • The gutenberg project contains archive of texts whose copyrights have expired and thus are free for any use.
  • Wikipedia database dumps provide huge base of open source written (encyclopedic) texts.

