HFST: Notes on the compilation of two-level rules

Alternative way to compile rules with several context parts

There are two well known methods to compile right arrow two-level rules with more than one context. Maybe there is yet another method which is relevant in the framework of intersecting composition. The method would reduce the multiple contexts into single context rules with auxiliary variants of the surface character.

Schematically the compilation goes as follows:

DirectedGraphPlugin_1.png diagram

An example

An example test lexicon:

LEXICON Root
raaka:raaKan #;
liuku:liuKun #;

Example rules as written according to the normal two-level rule formalism:

Alphabet
 a i k l n u K:%' ;
Rules
"Orig" K:%' => a _ a ;
        u _ u ; 

This rule set could be preprocessed into an equivalent form using two auxiliary symbols '1 and '2:

Alphabet
 a i k l n u K:%'1 K:'2 ;
Rules
"Mod1" K:%'1 => a _ a ;
"Mod2" K:%'2 => u _ u ;

In the general case, the auxiliary symbols have to be taken care of in all places where the original symbols occur. In any context, the original symbol must be replaced with the set of it and all its auxiliary versions. In all left-arrow rules where the original symbol is the right hand side of the pair, one must allow there it and all its auxiliary versions.

The modified rule set should be first composed with the lexicon and the result of this composition should be composed with a transduction which converts '1 and '2 into plain '. The following transducer given in text format would suit the example:

0
0       0       a       a
0       0       i       i
0       0       k       k
0       0       l       l
0       0       n       n
0       0       r       r
0       0       u       u
0       0       '1      '
0       0       '2      '

A makefile like the following would do the thing:

comptest.fst : comptest-lexc.fst comptest-twolc.fst fix-fst.fst Makefile
        hfst-compose-intersect -l comptest-lexc.fst comptest-twolc.fst |\
 hfst-compose -o comptest.fst -2 fix-fst.fst

comptest-lexc.fst : comptest.lexc Makefile
        hfst-lexc -o $@  $<

comptest-twolc.fst : comptest.twolc Makefile
        hfst-twolc -v -N -o $@  $<

fix-fst.fst : fix-fst.txt Makefile
        hfst-txt2fst -e @0@ -o $@ $<

Notes

  1. The compilation of => rules with multiple contexts requires somewhat complex formulas and appears to become computationally complex when the number of context parts increases. It is not unusual that the size of the compiled rule increases exponentially in respect with the number of contexts. Complex behaviour occurs when the contexts are relatively independent of each other, which is usually the case.
  2. The context resolution scheme is normally implemented as a pre-processing stage where the conflicts are detected and thereafter resolved by copying contexts to other rules. This may create rules with many contexts.
  3. There appears to be no obvious reason why the proposed scheme would be more complex than the present one. It could be substantially more efficient in demanding cases.
  4. It remains to be proven whether the construction is really equivalent with the current methods. The proposed scheme may be defective by not allowing
    • free overlaps of contexts of successive occurrences (but the compilation formula for single context rules does allows for repeated occurrences where contexts overlap, e.g. aaKaaKaa:aa'aa'aa),
    • inclusion of the centers in the contexts of other occurences (but replacing the occurrences of the symbol by a set including its variants should prevent this problem),
    • occurrences of the centers in mixed contexts etc.
  5. It appears that the proposed compilation method is in a direct relation to the method of R. Kaplan and M. Kay. The triple consisting of an indexed brackets and the center character corresponds to the variant. The difference lies in that in the framework of Kaplan and Kay all rule contexts had to be compiled into a single transducer (which is computationally heavy) whereas in the current proposal all contexts can be compiled separately (which is fast). The overhead of fixing the variants is expected to be much less than the gain in the compilation.

-- KimmoKoskenniemi - 2010-03-31

Topic revision: r5 - 2017-11-09 - ErikAxelson
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback