Digitaalisten ihmistieteiden osasto
PL 24 (Unioninkatu 40)
00014 Helsingin yliopisto

fin-clarin ( ATT ) helsinki.fi

Kielipankissa on käytössä
138 aineistoa
13 työkalua

Tulossa 91 kielivaraa

Kerro meille omasta kieliaineistostasi!

FIN-CLARIN    CSC - Tieteen tietotekniikan keskus
Kansainvälinen CLARIN-projekti

FIN-CLARIN Site Description: University of Tampere

  • Contact person:
Jaana Kekäläinen, (jaana.kekalainen at uta.fi) http://www.uta.fi/~lijakr/JK.html

Departments and other parties involved in collecting, producing or using language resources. For each, the main resources and activities are listed with

  • an informative name,
  • a short description of what the material is,
  • the size of the resource (in recorded hours, word tokens or other measures, source program lines)
  • how much labour (in person months) has been invested at the site for collecting, producing or improving the resource,
  • contact person, contact information
  • a link to further information about the resource (please, enter more detailed entries in MaterialEn).

School of Information Sciences, Department of Information Studies

  1. Corpora of Newspaper Texts
    • Computer corpora in Finnish, Swedish and English languages (newspaper texts), with requests and relevance information used in information retrieval evaluation.
    • About 142.2, 42.5, and 251 million word tokens respectively; or 1088MB, 281 MB, and 1530 MB respectively.
    • Contact person: Heikki Keskustalo (ät) uta.fi
  2. UTA Cross-Language Information Retrieval System (UTACLIR)
    • Utaclir is a Java program, which translates information retrieval queries from the source language to the target language. Utaclir utilizes external resources, for example a translation dictionary and a source language lemmatizer.
    • Size of the source code: 734 rows.
    • Contact person: Heikki Keskustalo (ät) uta.fi
  3. Skipgram Tools
    • Skipgram tools consist of two separate C-programs. The first one generates a file structure for approximate string matching on the basis on a given word list. The second gives the best s-grams for a given input word utilizing the constructed file structure.
    • Size of the source code: 2472 rows.
    • Contact person: Heikki Keskustalo (ät) uta.fi

School of Language, Translation and Literary Studies

Contact person: Mikhail Mikhailov, Professor (Act.), mikhail.mikhailov (ät) uta.fi, see http://www.uta.fi/~mikhail.mikhailov/ All corpora located at https://mustikka.uta.fi/corpora. Registration required.

  1. Russian-Finnish parallel corpus of literary texts (ParRus)
    • About 5,000,000 word tokens.
    • Contact person: Mikhail Mikhailov (ät) uta.fi
  2. Multilingual corpus of juridical texts (MulJur)
    • About 1,200,000 word tokens.
    • Contact persons: Mikhail Mikhailov (ät uta.fi)
  3. Comparable Russian-Finnish corpus of juridical texts (FinRusLex)
    • About 2,000,000 word tokens.
    • Contact persons: Mikhail Mikhailov (ät uta.fi)

School of Information Sciences, Department of Computer Sciences, Tampere Unit for Computer-Human Interaction

Contact person: Markku Turunen, Adjunct Professor
  1. Resource Collection of human-computer dialogues
    • Computer corpora of different spoken dialogue applications (e.g., timetable systems) in English and Finnish, collected both in laboratory experiments and real usage.
    • Thousands of dialogues (depending on the application)
    • Contact person: Markku Turunen (mturunen ät cs.uta.fi)
    • http://www.cs.uta.fi/hci/spi


For the purpose of transferring resource data from the KitWiki to the CLARIN ad hoc inventory, the data is gathered here in TWiki forms. Create a new topic with the following tools, and fill in the data. You may add extra information on the page, but only the data in the form will be transferred to the ad hoc inventory.

ALERT!: Some of the forms contain date or year fields. Take care to input these in the correct format (2008-12-01 or 1 Dec 2008) or use the calendar next to the field. For years, give them with four digits, i.e. 2008.

Add a Multimodal Corpus, Spoken Corpus, Written Corpus, Aligned Corpus, Treebank, or N-gram Model:

Add a Terminological Resource, Lexicon / Knowledge Source:

Add a Web Service:

Add a Grammar or any other resource:

The following resources have been added on this site: