HFST: Utilities for Compiling a Northern Sami Lexical Transducer


We tested the performance of HfstLexc2Fst and HfstTwolC by compiling a lexical transducer for northern smi. The open-source lexc-lexicon and twol-grammar for northern smi have been developped in the Smi language technology project and are available for everyone through anonymous SVN. They are distributed under the GPL-license.

The lexicon contained 61295 entries (i.e. stems and different kinds of affixes). The stems represented the word-classes verbs, common nouns, proper nouns, adjectives, adverbs, pronouns, numerals, particles, conjunctions, subjunctions and interjections (+ some minor classes like abbreviations).

The twol-grammar consisted of 105 rules.

The hfst-lexc and hfst-twolc input-file formats differ from corresponding Xerox formats. Hence lexicon and rules files, which have been developped for Xerox compilers need to be modified for use with hfst-tools. The necessary changes are explained below.

All files were compiled on the CSC server corpus, which has a Intel(R) Xeon(R) 3.00GHz processor and 3652568 kB of memory. The lexicon and grammar were combined into a leical transducer using HfstComposeIntersect.


Modifying the Northern Smi lexicon

We need only make one modification, to make Xerox lexc-files compile undex hfst-lexc.

Xerox lexc by default adds a word-boundary # at the beginning and end of all words. This is not done by hfst-lexc, so we need to do it explicitly. All lexical forms given by the Northern Smi lexicon end in the lexicon ENDLEX

 @D.NeedNoun.ON@%# ;

We just add word-boundaries

 @D.NeedNoun.ON@%# # ; ! This line differs. The first # is the word-boundary symbol.
                       ! The one signifies, that there are no continuation-classes for this
                       ! lexicon.

Compiling the lexicon using hfst-lexc

We used the command

time ~/SFST/src/hlexc/src/hfst-lexc --output=sami.hlexc sme-lex.txt abbr-sme-lex.txt acro-sme-lex.txt adj-sme-lex.txt adv-sme-lex.txt conjunction-sme-lex.txt interjection-sme-lex.txt noun-sme-lex.txt numeral-sme-lex.txt particle-sme-lex.txt pp-sme-lex.txt pronoun-sme-lex.txt punct-sme-lex.txt subjunction-sme-lex.txt verb-sme-lex.txt
The size of the binary produced was about 2,0Mb. The timing produced
real    2m16.015s
user    1m51.960s
sys     0m8.727s


Modifying the rule-file

More modifications are needed for the two-level rules, than for the lexicon.

The alphabet of the original two-level grammar contains all of the symbols used, but only a few of all pairs

 a b c d e f g h i j k l m n o p q   ! small
 r s t u v w x y z      %-

 A B C D E F G H I J K L M N O P Q   ! capital
 R S T U V W X Y Z     

 e7:e e9:e i7:i o7:o o9:o u7:u 7:  ! Morphophonemes

 &269; đ ŋ  ŧ           ! Smi letters.
 Č Đ Ŋ  Ŧ           !

 #:0 %^:0

 ':0 %/ ¤:0 '7:'
                            ! ' is for CnsGrad of the lg:lgg and l'l:ll type
                            ! ¤:0 prevents ConsGrad in certain words
                            ! '7 is the real apostroph
                            ! # is used to mark both lexicalised and
                            ! derived compounds

 h7:h h8:h g8:g m8:m n8:n   ! the x8 ones are consonants that alternate in
 H7:H H8:H G8:G M8:M N8:N   ! stem-final positions.

 j9:j b9:b d9:d g9:g h9:h k9:k m9:m n9:n p9:p s9:s t9:t z9:z 9: r9:r
 J9:J B9:B D9:D G9:G H9:H K9:K M9:M N9:N P9:P S9:S T9:T Z9:Z 9: R9:R

                            ! The x9 ones are consonants that never alternate.
                            ! The capital J9 etc. do not work.
 X1:0 X2:0 X3:0 X4:0 X5:0 X6:0 X7:0 X8:0 X9:0   ! diacritics
 Q1:0 Q2:0 Q3:0 Q4:0 Q5:0 Q6:0 Q7:0 Q8:0 Q9:0   ! They trigger morphophono-
 Y1:0 Y2:0 Y3:0 Y4:0 Y5:0 Y6:0 Y7:0 Y8:0 Y9:0   ! logical rules
 W1:0 W2:0 W3:0 W4:0 W5:0 W6:0 W7:0 W8:0 W9:0

 %>:0 %>7:%> :0        ! stem-suffix border mark, unvisible for normal use

In a hfst-twolc grammar-file, all pairs need to be added in the grammar. Hence we need to add the following pairs to the alphabet.
      ! Miikka added, because they are needed 
      ! by hfst-twolc, which has to know all pairs.

        j:i # k:0 t:0 h:0 d:0 ':t ':n ':l 0:s ':j ':v d:n ':k k:v d:t t:d ':d i:e u:o
        i: a:i :0 u:0 e:0 o:0 a:0 i:0 :0 n:0 z:s m:n h:t g:t b:t h:t : :0
        i: s:0 j:0 k:g p:0 f:0 l:0 ':b ':g ':r ':m b:m b:p ':s a:u p:b r:0 t:đ
        o:u e:i :č a: a:o a:e i:o i:u g8:0 n8:0 m8:0 p:t g:0 m:0 :0 đ:0 ŋ:0 v:0 c:0
        ':f ':p ':z ':c ': ':z g:ŋ g:k z:c č: c:z :a e7: g8:t h8:t m8:n ŧ:0 
        č:0 ':č ': h8:0 G8:0 M8:0 N8:0 H8:0

Xerox twolc recognizes flag-diacritics from the lexicon and does not interfere with them in any way. There is no similar automated mechanism in hfst-twolc (at least not yet). Hence flag-diacritics need to be declared as regular diacritics in the grammar. We add a Diacritics section to the grammar


%@C.NeedNoun%@ %@D.NeedNoun.ON%@ %@P.NeedNoun.ON%@ %@U.Cap.Obl%@ %@U.Cap.Opt%@

For now, we also need to add a section Rule-variables, which contains every rule-variable used in the grammar. Omitting it doesn't really cause anything to be compiled incorrectly, but there will be numerous irritating warnings about symbols, which haven't been declared. We might fix this in future versions of hfst-twolc. We add the following between the Diacritics and Sets sections.

Cx Cy Cz Vx Vy Vz ;

Compiling the rules using hfst-twolc

We used the command

time htwolc --input sami.twol --output sami.twolc --resolve
The size of the binary produced was about 7,7Mb. The timing produced
real    3m10.985s
user    2m56.345s
sys     0m12.346s


We used the command

cat sami.twolc | time hfst-compose-intersect --lexicon sami.hlexc
The size of the binary produced was about 2,7Mb. The timing produced
real 276.56s    (i.e. ~ 4min36s)
user 249.37s    (i.e. ~  4min9s)
sys 24.45s 


The Sami transducer can be fetched from the Sourceforge download page.

-- MiikkaSilfverberg - 05 Oct 2008
Edit | Attach | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r13 - 2010-12-31 - TommiPirinen
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback