HFST: Utilities for Compiling a Northern Sami Lexical Transducer

Tip, idea Current versions of giellatekno projects support HFST compiling out of the box. Just say make GTLANG=sme hfst to build Northern Smi with HFST tools. To understand the details under the hood, you can still read this documentation

Introduction

We've tested the performance of HfstLexcWrapper and HfstTwolC (of the version 2 branch) by compiling a lexical transducer for northern smi. The open-source lexc-lexicon and twol-grammar for northern smi have been developped in the Smi language technology project and are available for everyone through anonymous SVN. They are distributed under the GPL-license.

The lexicon contained 61295 entries (i.e. stems and different kinds of affixes). The stems represented the word-classes verbs, common nouns, proper nouns, adjectives, adverbs, pronouns, numerals, particles, conjunctions, subjunctions and interjections (+ some minor classes like abbreviations).

The twol-grammar consisted of 105 rules.

The hfst-lexc and hfst-twolc input-file formats differ from corresponding Xerox formats. Hence lexicon and rules files, which have been developped for Xerox compilers need to be modified for use with hfst-tools. The necessary changes are explained below.

All files were compiled on the CSC server corpus, which has a Intel(R) Xeon(R) 3.00GHz processor and 3652568 kB of memory. The lexicon and grammar were combined into a leical transducer using HfstComposeIntersect.

hfst-lexc

Modifying the Northern Smi lexicon

We need only make one modification, to make Xerox lexc-files compile undex hfst-lexc.

Xerox lexc by default adds a word-boundary # at the beginning and end of all words. This was not done by hfst-lexc at that time, so we needed to do it explicitly. All lexical forms given by the Northern Smi lexicon end in the lexicon ENDLEX

LEXICON ENDLEX
 @D.NeedNoun.ON@%# ;

We just added the word-boundaries

LEXICON ENDLEX
 @D.NeedNoun.ON@%# # ; ! This line differs. The first # is the word-boundary symbol.
                       ! The one signifies, that there are no continuation-classes for this
                       ! lexicon.

Compiling the lexicon using hfst-lexc

We used the command

time hfst-lexc --output=sami.hlexc sme-lex.txt abbr-sme-lex.txt acro-sme-lex.txt adj-sme-lex.txt adv-sme-lex.txt conjunction-sme-lex.txt interjection-sme-lex.txt noun-sme-lex.txt numeral-sme-lex.txt particle-sme-lex.txt pp-sme-lex.txt pronoun-sme-lex.txt punct-sme-lex.txt subjunction-sme-lex.txt verb-sme-lex.txt
The size of the binary produced was about 2,0Mb. The timing produced
real    2m16.015s
user    1m51.960s
sys     0m8.727s

hfst-twolc

Modifying the rule-file

More modifications are needed for the two-level rules, than for the lexicon.

The alphabet of the original two-level grammar contains all of the symbols used, but only a few of all pairs

Alphabet
 a b c d e f g h i j k l m n o p q   ! small
 r s t u v w x y z      %-
                           

 A B C D E F G H I J K L M N O P Q   ! capital
 R S T U V W X Y Z     
                       

 e7:e e9:e i7:i o7:o o9:o u7:u 7:  ! Morphophonemes

 &269; đ ŋ  ŧ           ! Smi letters.
 Č Đ Ŋ  Ŧ           !

 #:0 %^:0


 ':0 %/ ¤:0 '7:'
                            ! ' is for CnsGrad of the lg:lgg and l'l:ll type
                            ! ¤:0 prevents ConsGrad in certain words
                            ! '7 is the real apostroph
                            ! # is used to mark both lexicalised and
                            ! derived compounds



 h7:h h8:h g8:g m8:m n8:n   ! the x8 ones are consonants that alternate in
 H7:H H8:H G8:G M8:M N8:N   ! stem-final positions.

 j9:j b9:b d9:d g9:g h9:h k9:k m9:m n9:n p9:p s9:s t9:t z9:z 9: r9:r
 J9:J B9:B D9:D G9:G H9:H K9:K M9:M N9:N P9:P S9:S T9:T Z9:Z 9: R9:R

                            ! The x9 ones are consonants that never alternate.
                            ! The capital J9 etc. do not work.
 X1:0 X2:0 X3:0 X4:0 X5:0 X6:0 X7:0 X8:0 X9:0   ! diacritics
 Q1:0 Q2:0 Q3:0 Q4:0 Q5:0 Q6:0 Q7:0 Q8:0 Q9:0   ! They trigger morphophono-
 Y1:0 Y2:0 Y3:0 Y4:0 Y5:0 Y6:0 Y7:0 Y8:0 Y9:0   ! logical rules
 W1:0 W2:0 W3:0 W4:0 W5:0 W6:0 W7:0 W8:0 W9:0

 %>:0 %>7:%> :0        ! stem-suffix border mark, unvisible for normal use

;
In a hfst-twolc grammar-file, all pairs need to be added in the grammar. Hence we need to add the following pairs to the alphabet.
 
      ! Miikka added, because they are needed 
      ! by hfst-twolc, which has to know all pairs.

        j:i # k:0 t:0 h:0 d:0 ':t ':n ':l 0:s ':j ':v d:n ':k k:v d:t t:d ':d i:e u:o
        i: a:i :0 u:0 e:0 o:0 a:0 i:0 :0 n:0 z:s m:n h:t g:t b:t h:t : :0
        i: s:0 j:0 k:g p:0 f:0 l:0 ':b ':g ':r ':m b:m b:p ':s a:u p:b r:0 t:đ
        o:u e:i :č a: a:o a:e i:o i:u g8:0 n8:0 m8:0 p:t g:0 m:0 :0 đ:0 ŋ:0 v:0 c:0
        ':f ':p ':z ':c ': ':z g:ŋ g:k z:c č: c:z :a e7: g8:t h8:t m8:n ŧ:0 
        č:0 ':č ': h8:0 G8:0 M8:0 N8:0 H8:0

Xerox twolc recognizes flag-diacritics from the lexicon and does not interfere with them in any way. There is no similar automated mechanism in hfst-twolc (at least not yet). Hence flag-diacritics need to be declared as regular diacritics in the grammar. We add a Diacritics section to the grammar

Diacritics

%@C.NeedNoun%@ %@D.NeedNoun.ON%@ %@P.NeedNoun.ON%@ %@U.Cap.Obl%@ %@U.Cap.Opt%@

;
For now, we also need to add a section Rule-variables, which contains every rule-variable used in the grammar. Omitting it doesn't really cause anything to be compiled incorrectly, but there will be numerous irritating warnings about symbols, which haven't been declared. We might fix this in future versions of hfst-twolc. We add the following between the Diacritics and Sets sections.
Rule-variables

Cx Cy Cz Vx Vy Vz ;

Compiling the rules using hfst-twolc

We used the command

time htwolc --input sami.twol --output sami.twolc --resolve
The size of the binary produced was about 7,7Mb. The timing produced
real    3m10.985s
user    2m56.345s
sys     0m12.346s

hfst-compose-intersect

We used the command

cat sami.twolc | time hfst-compose-intersect --lexicon sami.hlexc
The size of the binary produced was about 2,7Mb. The timing produced
real 276.56s    (i.e. ~ 4min36s)
user 249.37s    (i.e. ~  4min9s)
sys 24.45s 

Downloading

The Sami transducer can be fetched from the Sourceforge download page, or for more up-to-date versions, directly from giellatekno SVN.


-- MiikkaSilfverberg
Topic revision: r15 - 2014-02-10 - ErikAxelson
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback