OMorFi: a SFST-style interface implemented with OpenFst

This page describes how OpenFst was used to create a SFST-style interface.

Differences between functions in SFST and OpenFst interfaces

OpenFst functions often modify their input arguments.

The character replacing method in OpenFst is made through composition at both ends of the argument transducer as deleting arcs in a transducer is difficult. Character replacing is needed when an agreement variable (presented by a single symbol) must be replaced with one of the values of that agreement variable (i.e. one of the paths in the transducer defining the agreement variable). Replacing symbol A with transducer B in transducer T is done as follows (pseudocode):

replace_char(char A, Transducer B, Transducer T) == ( [^A] | {_B}:A )* || T || ( [^A] | A:{^B} )*

The leftmost transducer maps any other symbol than A to itself or the lower level of transducer B to A. The rightmost transducer maps any other symbol than A to itself or A to upper level of transducer B.

As a value of an assignment variable B can be longer than one symbol, {_B}:A might have input epsilons and A:{^B} output epsilons that make the resulting transducer asynchronous.

In SFST the symbol tables are stored with the transducers. In OpenFst the symbol table is stored in binary form in file 'symbol_table' every time a transducer is written in file. It is the programmer's responsibility to make sure that all transducers use the same symbol table by defining the characters same way in each file.

Differences in performance between SFST and OpenFst with OMorFi rules

Compiling OMorFi rules at ruuvi.helsinki.fi takes less than 10 minutes with SFST and almost an hour with OpenFst. At corpus.csc.fi the times are 15 minutes and one and half hour. The HFST code must clearly be optimised. It seems that the resulting transducers are equivalent. The HFST transducer takes about twice as much memory as SFST transducer, maybe because the zero weights are stored, too.

The slowest and most memory-consuming part of compiling the rules is composition of the lexicon and the rules. The rules are applied in cascade, so the result gets bigger every time it is composed with another rule. As a result the computation of composition gets slower. Computing all cascade compositions by intersection will probably be faster. The intersection composition is already implemented with SFST functions by Miikka Silfverberg. It must be implemented with OpenFst, too.

Analysing 1000 words takes about 5 seconds with OpenFst and 43 seconds with fst-infl (a SFST program). Both programs use the principle of composing the word to be analysed with the rules transducer, extracting the surface level of the resulting composition and printing all strings generated by this automaton. It seems that composition is done more efficiently in OpenFst than in SFST. However, using fst-infl2 (a more efficient SFST program for analysing) it takes only about 1 second to analyse 1000 words. All tests were done at corpus.csc.fi.

HFST CVS Repository

The HfstCvsRepositoryOLD.

-- ErikAxelson - 17 Mar 2008

Topic revision: r7 - 2009-09-30 - ErikAxelson
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback