Finnish-Swedish MT Challenge 2006: Second Group

The original plan (state-of-the-art)

The first idea was to do everything our selves. So, we started to implement state-of-the-art hybrid solution which is visible in the next picture.

Big_picture.png

At the beginning of the process we clean the data (convert all numbers to number 2, remove strange characters, remove extra spaces, etc). Cleaned file will be streamed into POS extractor (fdg) and to the humans will extract simple and generic first level rules. The outcome of POS is called tagged corpus. This corpus will be sent to the syntactical tree parser which extract syntactical trees and to the POS model generators. The syntactical trees will be used as a input to the humans who will extract second level rules. All these outcomes (1. & 2. level rules, POS model output) will be used as a features and as an input to the main part of the system (ST Mapping model).

We wanted to give the system some simple rules that would cover for some of the structural diferences of the two languages. Example of a rule including the handling of the partitive case using the fdg tags:

Haluan kahvia. -> Jag vill ha kaffe.

[1, N, SG, PTV] -> [1, N, SG, ACC]

Example of a genetive construction:

[1, N, SG, GEN] ["kanssa", 2, PSP] -> ["med", 2, PRE] [1, N, SG, NOM]

The lowest part of the figure shows out our separate tags only model - built from using aligned data - which learns the order of tags in the foreign (now swedish) language. This will (hopefully) give some extra boost to the features explained earlier.

We did also "some" experiments with MTTK. MTTK, is "a collection of software tools for the alignment of parallel text for use in Statistical Machine Translation", which seems to be a pretty decent toolkit (after you get it up and running). For more info please check http://mi.eng.cam.ac.uk/~wjb31/distrib/mttkv1/.

The backup plan (which works and was used for our submission)

Unfortunately it took very long before we believed that this task is not so easy. Therefore we did not have so much time to play around with the backup plan. but luckily we got one!

The backup plan and the system used to generate our submission is as follows

Backup_plan.png

Used Data Set

We used the already to some extent cleaned data set containing 122754 lines of Finnish and Swedish aligned text. The Finnish text contained 1821798 words, the Swedish one 2402781 words.

Here some example lines from the Finnish and the Swedish sets:

<s> pörssitiedote </s>
<s> vapaasti julkaistavissa </s>
<s> <num> klo <num> </s>
<s> aktia nostaa prime-korkoaan <num> prosenttiyksiköstä <num> prosenttiyksikköön </s>

<s> börsmeddelande </s>
<s> fritt för publicering </s>
<s> <num> kl </s>
<s> aktia höjer sin prime-ränta från <num> till <num> procentenheter </s>

How to fdg

Use fi-fdg &sv-fdg. NOTE !!! Use corpus3.csc.fi or you may hit your head to wall as I did. Sample output is as follows

Finnish

Item Word Base form Output
1 Bensiinin bensiini attr:>2 &A> N SG GEN
2 myynnin myynti attr:>3 &A> N SG GEN
3 kasvu kasvu subj:>4 &NH N SG NOM
4 oli olla main:>0 &+MV V ACT IND PAST SG3
5 2,5 2,5 qn:>6 &QN> NUM CARD
6 % % &NH N
7 . .

Swedish

Item Word Base form Output
1 Bensinförsäljningen bensin#försäljning subj:>2 %NH N SG NOM
2 ökade öka main:>0 %MV V PAST
3 med med advl:>2 %AH PREP
4 2,5 2,5 attr:>5 %>N NUM NOM
5 % % pcomp:>
6 . .

Preprocessor

The original cleaned data contained was aligned by sentences (one line one sentence), therefore the Finnish data set contained the same amount of lines as the Swedish one. However, after extracting the word tags using fdg and reconstructing the sentences, there was a mismatch in lines in both files. Some of the Finnish and Swedish sentences got combined, some split into two or more and some got completely dropped. This fdg feature effectively destroyed the sentence alignment. When this happens in the beginning of the file, all following sentences are mis-aligned. Additionally Finnish and Swedish sentences were changed in different ways.

In order to get good results, the input data has to be as good as possible. Even if the data is not perfect, we found it unacceptable that some tool we used significantly decrease the quality of our data set. Therefore we decided to spend sufficient time to properly clean the data so that the sentence alignments are preserved.

Filter

We created a filter which replaces every word by its baseform and morphological word tag(s).

hex N helsinki N-SG-GEN pörssi N-SG-NOM tiedotus#väline N-PL-NOM
akti N-SG-PTV säästö#pankki N-SG-NOM oyj N-SG-NOM
pörssi#tiedote N-SG-NOM
hex N-NOM helsingfors N-SG-NOM böra PRES informations#medium N-PL-NOM
aktia N-SG-NOM spar#bank N-SG-NOM abp N-SG-NOM
börs#meddelande NDE

Tools used

The whole training process with the given data for one model (FiT -SvT or SvT -Sv) took about 3.5 - 5 hours, the translation of the 1000 test sentences another 1.2 hours (2 times), so the whole training took about 12-13 hours.

create giza compatible input, converting words into numbers and creating a number-word file plain2snt.out fi sv

create the co-occurence file snt2cooc.out fi.vcb sv.vcb fi_sv.snt > fi-sv.cooc

create the swedish language model ngram-count -text clean.sv -lm sv_model.lm -interpolate -kndiscount

run moses script which uses mkcls, giza and other tools train-factored-phrase-model.perl --root-dir . --corpus clean --f fi --e sv --scripts-root-dir / work/mt/bin/scripts --first-step 1 --last-step 9 --lm "0:16:sv_model.lm" --decoding-steps "t0,g 0" --translation-factors "0-0" --reordering-factors "0-0" --generation-factors "0-0" --parallel >> moses.log

SRILM

We used the SRI Language Modeling Toolkit to generate our Swedish language model using the ngram-count command.

mkcls

We used mkcls v2, which is a tool to train word classes by using a maximum-likelihood-criterion and can be found at http://www.fjoch.com/mkcls.html. Mkcls is used for from within the moses script train-factored-phrase-model.perl.

GIZA++

GIZA++ is used for from within the moses script train-factored-phrase-model.perl.

Improvements

There is a lot of space for improvements. Given the restricted time for our backup plan, unfortunately we were not able to train models with many different parameters in order to improve the quality of translation.

Some possible improvements are:

  • We would like to have split the data set into a training and validation set, to use the validation set for finding the best possible parameters.
  • We add human generated rules for those tags the corresponding rule exists. So, this
BF TAGS -> BF TAGS
Will be changed to this
BF TAGS RULE -> BF TAGS RULE

Milestones

 

Questions and communication

 
  • hmm, system clock check (22:03:22) -- JanneArgillander - 09 Jan 2007 - 22:03
  • Even our original idea would have been state-of-the art solution we must forget it and do something else ASAP that we can submit something... -- JanneArgillander - 06 Jan 2007 - 18:19
  • Ok, soon we are going back to the business... We just have been too busy. "No Worries(?), mate!" -- JanneArgillander - 29 Nov 2006 - 15:39



Topic attachments
I Attachment Action Size Date Who Comment
PNGpng Backup_plan.png manage 6.7 K 2007-01-09 - 19:48 UnknownUser The backup plan: Architecture
PNGpng Big_picture.png manage 8.8 K 2007-01-06 - 16:37 UnknownUser The architecture of our original plan which we implemented up to some phase
Topic revision: r11 - 2007-01-10 - JanneArgillander
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback