MT Challenge Resources

T-61.6090 Kieliteknologian erikoiskurssi

Finnish-Swedish Machine Translation Challenge, Autumn 2006, 7 cr

T-61.6090 Kieliteknologian erikoiskurssi


Perjantai 22.9 09-16 salissa T2


Monikielisiä korpuksia

EUROPARL

A Multilingual Corpus for Evaluation of Machine Translation

The JRC-Acquis Multilingual Parallel Corpus

Before joining the European Union (EU), the new Member States (NMS) needed to translate and approve the existing EU legislation, consisting of selected texts written between the 1950s and 2005. This body of legislative text, which consists of approximately eight thousand documents and which covers a variety of domains, is called the Acquis Communautaire (AC). As there were 20 official EU languages at the beginning of the year 2005, the AC thus exists as a parallel text (text and its translation) in 20 languages. The language are Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Slovak, Slovene and Swedish. The EU Candidate Countries Croatia, Romania and Bulgaria have started translating the AC, so that some of the documents are available in these languages, as well. However, the texts in these languages are not currently part of the distribution.

LE-PAROLE

"LE-PAROLE project (MLAP/LE2-4017) aims to offer a large-scale harmonised set of "core" corpora and lexica for all European Union languages."

"LE-PAROLE -hankkeen tehtävänä on kerätä EU-kielistä tietokoneella luettavaa määrämuotoista tekstimateriaalia n. 20 milj. sanaa sekä alustava tietokonemuotoinen sanakirja. Projektiin liittyvät kielet ovat englanti, ranska, espanja, hollanti, iiri, italia, katalaani, kreikka, norja, portugali, ruotsi, tanska, saksa ja suomi. Projektissa on mukana myös Kotimaisten kielten tutkimuskeskus."

SFPC - Swedish-Finnish Parallel Corpus (meneillään)

"The aim of this project is to create a Swedish-Finnish, Finnish-Swedish parallel corpus aligned at sentence level at the minimum and tagged with morphosyntactic information at the very minimum. The project is currently in the planning phase. If it will be launced, it will be a joint effort between CSC and the Swedish deparment at the Research Centre for the Languages of Finland."

Parallel Corpora in Uppsala

"This page contains an overview of the parallel corpora projects in the Language Engineering (Språkteknologi) group of the Linguistic department of the University of Uppsala in Sweden."

Oslo Multilingual Corpus OMC

"The Oslo Multilingual Corpus (OMC) is a collection of text corpora comprising original texts and translations from several languages. The various sub-copora differ in that they contain a different number of languages or a different combination of languages."

"The OMC provides unique research material for use in contrastive studies and translation studies, as well as in theoretical and applied linguistics."

PAROLE SIMPLE

"The goal of SIMPLE project is to add semantic information, selected for its relevance for LE applications, to the set of harmonised multifunctional lexica built for 12 European languages by the PAROLE consortium. PAROLE +SIMPLE lexicons contain morphological, syntactic and semantic information, organised according to a common model and to common linguistic specifications. PAROLE+ SIMPLEwill be available in the public domain to the LE R&D community at large and will allow the construction of a new generation of applications, services and products."

FINLEX

"Tietokanta sisältää suomalaisista säädöksistä tuotettujen vieraskielisten käännösten hakemiston ja noin 300 säädöksen käännöstekstit."

Euroopan perustuslaki

Perustuslain teksti kokonaisuudessaan eri kielillä


Yksikielisiä korpuksia

FISK-korpus

Suomenruotsin-korpus

Oulun korpus

"Oulun korpus on 1960-luvun suomen yleiskielen sähköinen tutkimusmateriaali, joka on koostettu Oulun yliopiston prof. Pauli Saukkosen johdolla. Aineiston myöhempi muunnos SGML-muotoon on tehty Kotimaisten kielten tutkimuskeskuksessa vuonna 1997. Aineisto on koottu 1960-luvun yleiskielestä. yleensä viiden virkkeen mittaisia otoksia."


Muuta

Project Runeberg

"A volunteer effort to create free electronic editions of classic Nordic (Scandinavian) literature and make them openly available over the Internet."

Project Gutenberg

"Project Gutenberg is the oldest producer of free ebooks on the Internet. Our collection was produced by hundreds of volunteers."

Kielipankin tekstikokoelmat

"Kielipankissa on 97 suomenkielistä osakokoelma(a). Kielipankin dokumenteista 645274 (71 %) on suomenkielisiä. Tämä vastaa 179556341 (73 %) sanaa. (Kaikkiaan Kielipankissa on 897331 dokumenttia. Yhteenlaskettu sanamäärä on 245640269.)"

LEXIN

"Lexin is primarily produced to meet the need of immigrant education. Lexin currently consists of about 30000 words. Sisltää sekä suomi-ruotsi että ruotsi-suomi käännös mahdollisuuden. Sekoitus leksikkoa ja sanakirjaa."

SWECG ja FINTWOL

Kielipankin Kieliteknologiset ohjelmat

PARALLEL CORPORA

"Some projects associated with ParaConc are described here."

IMS Corpus Workbench

"cedar.csc.fi-palvelimelle on lisensoitu helppokäyttöinen ja monipuolinen IMS Corpus Workbench -ohjelmisto korpustyötä varten. Ohjelmistoa saa käyttää ainoastaan akateemisen tutkimukseen."

CSC:n keräämiä linkkejä

"Tälle sivulle keräämme linkkejä vapaasti käytettävissä oleviin kieliaineistoihin maailmalla ja sellaisiin kieliteknologisiin sovelluksiin, jotka ovat vapaasti saatavilla ja vaikuttavat käytettäviltä."

GIZA++: Training of statistical translation models.

"GIZA++ is an extension of the program GIZA (part of the SMT toolkit EGYPT) which was developed by the Statistical Machine Translation team during the summer workshop in 1999 at the Center for Language and Speech Processing at Johns-Hopkins University (CLSP/JHU). GIZA++ includes a lot of additional features. The extensions of GIZA++ were designed and written by Franz Josef Och."

Moses: Decoder for statistical machine translation system

Moses is a factored phrase-based beam-search decoder for statistical machine translation.

FILT:n tuotteita

-- KimmoKoskenniemi - 27 Sep 2006

Topic revision: r3 - 2006-09-28 - JaakkoVayrynen
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback