HFST: Tutorial hfst-lexc and hfst-twolc

Introduction

The programs hfst-lexc and hfst-twolc presented in this tutorial are used for constructing a lexical transducer, i.e. a transducer, which associates a morphological analysis with a phonological representation corresponding to it.

The morphotax is handled using the lexicon compiler hfst-lexc. It associates the morphological analysis of a word-form with one or more morpho-phonological representations for the form. The mapping from morpho-phonological forms to phonological forms is accomplished by a twolc rule-set, which gives a set of two-level correspondences holding between all valid pairs of morpho-phonological and phonological strings.

Both the lexc lexicon and the twolc rule-set are compiled into transducers, which are combined into a single transducer using the tool hfst-compose-intersect. The lexical transducer thus created may be used both for analysis of word-forms and generation of word-forms out of analyses.

The programs hfst-lexc and hfst-twolc are open-source and have been modelled upon the corresponding proprietary Xerox programs lexc and twolc. Old lexicon-files and rule-sets written for the Xerox tools, should be usable with our tools as well, but may require some editing to compile properly. We try to cover the main differences between the Xerox tools and our programs in this tutorial.

The programs are demonstrated using a small example-lexicon and rule-set for finnish nouns. The nouns in the example exhibit gradation of the stem-final plosive when inflected. E.g. the form singular genetive case for the word poika (boy) is pojan, which examplifies two alternations, namely i:j between two vowels and the gradation k:0.

The lexicon and rulesets demonstrated here can be found from the attachments.

Lexc (for more info visit HfstLexc)

Hfst-lexc specifies a mapping between morphological and morph-phonological forms. This is accomplished by constructing a tree of lexicons, where the children of each lexicon describe possible continuations (i.e. inflections) for some of its word-forms. One way to group word-entries is to combine words in the same word-class together. Each word-entry is then given a continuation-class according to its declination. The continuation-classes giving the inflection-patterns for the different declination may then have continuation-classes of their own, e.g. classes of clitics.

An hfst-lexc lexicon has two compulsory parts: the list of multicharacter-symbols and the root-lexicon. It may additionally contain other lexicons, a meta-data section and an alphabet with all known symbols. The declaration of the alphabet speeds up the compilation-process.

All symbols longer than one unicode-character have to be declared. This is done in the multicharacter-symbol list. In our example-grammar the multicharacter-symbol list looks like this

  Multichar_Symbols
          ^T ^K ^P
          ^A
          +SG +NOM +GEN +PAR +TRA
          ##

The symbols ^K, ^P and ^T are morpho-phonemes participating in the gradation-phenomenon. The realization of the arch-phoneme ^A is governed by vowel-harmony. The symbol +SG marks singular and +NOM, +GEN, +PAR and +TRA are case-markers.

In addition to grammatical symbols there is the multicharacter-symbol ## signifying a word-boundary. In Xerox lexc word-boundaries are present in the lexicon by default, but you have to declare them in hfst-lexc. The choice of the symbol ## for word-boundary is arbitrary. The only thing that matters is that the same symbol is used in hfst-twolc, when referring to word-boundaries.

After the list of multicharacter-symbols, one or more lexicons follow. The first one of these is the root-lexicon and it is compulsory. In our example the root-lexicon looks like

  LEXICON Root
  0:##  Noun    ;

It states, that all valid pairs of morphological and morpho-phonological forms begin with a word-boundary 0:## and continue with something specified in the continuation-class Noun. If there would be any other word-classes besides nouns, e.g. adjectives, we'd have more entries in the root-lexicon, but now we only have one.

All lexicons have the same appearance. They begin with a declaration consisting of the key-word LEXICON followed by the name of the lexicon, which always begins with a capital letter. E.g.

  LEXICON LexName

The declaration is followed by one or more entries, which consist of a morphological and morpo-phonological form separated by a colon, followed by the name of a continuation-class and a semi-colon, which ends the entry.

The principle is made clear in the following excerpt from the continuation-class Noun, which is itself a lexicon

  LEXICON Noun
  tikka:tik^Ka  Cases   ;
  loppu:lop^Pu  Cases   ;
  katto:kat^To  Cases   ;
  ...

The first entry consists of the morphological form tikka (woodpecker), a morpho-phonological form tik^Ka and the continuation class Cases. Here the last plosive of tikka is subject to gradation, which we mark with ^K in the morpho-phonological form of the word.

The continuation-class Cases in turn looks like

  LEXICON Cases
  +SG+NOM:              Final     ;
  +SG+GEN:n             Final     ;
  +SG+PAR:^A            Final     ;
  +SG+TRA:ksi           Final     ;

and finally Final is defined

  LEXICON Final
  0:##    #       ;

The continuation-class # is a special one, the empty class. It signifies that forms end here. Note also how we take care not to use reserved word, such as End here, as the name of the lexicon.

A valid correspondance is one, which consists of parts from Root, Noun, Cases and Final in this order. An example of a valid correspondance is

  0 t i k k a +SG +TRA 0 0 : ## t i k ^K a k s i ##

Here 0 signifies the empty string. In the transducer compiled from the lexicon-file this correspondence is represented by the path

  0:## t i k k:^K a +SG:k +TRA:s 0:i 0:##

Hfst-lexc combines the morpholgical form and morpho-phonological form into a string of pairs. This is done separately for each lexicon. The principle for combining the strings is easy. Symbols at the same positions are paired together and if the strings have unequal length, the shorter one is padded with zeros at the end. E.g.

  xyx:XYZ       becomes x:X y:Y z:Z
  xy:XYZ        becomes x:X y:Y 0:Z
  :##           becomes 0:##

The lexicon is compiled using the command

  hfst-lexc -o grad.hfst grad.lexc

Here grad.hfst is the name of the transducer-file, that is produced as output.

Twolc (for more info visit HfstTwolC)

Two level rule formalism is based on concept, that there is a set of possible character pairs or correspondences in languages, and rules specify contexts in which these correspondences appear. When we start from the forementioned situation that in lexicon we have created a kind of a morphophonemic level, that defines words forms as patterns constructed from regular letters and possibly morphophonemes, which may vary on context rules we specify here. For example the lexicon might have defined word form poika+sg+nom, whose morphophonemic level was poika, we have alphabet that defines plausible pairs p:p o:o i:i k:k a:a, and no others and no rules, so the result will be poika. From this we can say that basic alphabet contains identity pairs of letters of language which we know that do not undergo any variation. Thus we write the full alphabet of a language to our source file, for our simple Finnish example we can cope with lower case alphabet a through z, , , , and , which can be considered the alphabet one mainly uses writing Finnish.


  Alphabet a b c d e f g h i j k l m n o p q r s t u v w x y z     

The part of alphabet that undergoes variation is second part to take in consideration and define. In Finnish gradation example we actually use two kinds of variations; the gradation concerns all stops, and it is complex, and does not really have a trivially definable context to tell us exactly which stops undergo gradation, so we have encoded this information on lexicon side by using special characters ^K, ^P and ^T, which always gradate, leaving other k’s, p’s and t’s non-gradating. This leaves us to define the actual variants gradation may cause: For strong grade k is k, p is p and t is t, which we mark as %^K:k and so forth (here’s one catch, many of the characters in ASCII set field apart from alphabets bear special meaning to twolc, so you need to always remember to prepend a % to them). For the weak grade we have more variants; k can become g, j or v, or ’ or even disappear, for which we use digit 0 and mark thusly %^K:’ %^K:v %^K:0 %^K:g %^K:j. Performing same listing for other stops we can add gradation to our set of feasible pairs:

  Alphabet a b c d e f g h i j k l m n o p q r s t u v w x y z     
  %^K:' %^K:v %^K:0 %^K:g %^K:j
  %^T:l %^T:r %^T:n %^T:0 %^T:d

Now final step is to recall, that we made symbol ## in our lexicon to signify word boundary, to aid us in rule writing. Of course this is not at all the kind of character we want to see in final result, so we make only feasible pair concerning this character to be disappearance, it is ##:0, which ensures that it will be deleted anywhere in final forms. This kind of auxiliary symbols, in most of the practical examples may be quite common.

Now with this definition alone we can create all the possible forms of Finnish language, but it also allows impossible variants like strong grades where weak should be, or wrong variant of weak grade. So what is left here is to write the rules, which restrict the variants to right contexts.

In most rules we need to refer concepts like any consonant, any vowel, and so forth. For this purpose twolc lets us define named variables called sets, which used thereafter apply for any of the member in named set. For Finnish morphophonology we now write all phonologically motivated sets of characters we might use in rules::

  Sets
        Cons = b c d f g h j k l m n p q r s t v w x z ' %^K %^P %^T  ;
        Vowel = a e i o u y   %^A %^O %^U ;
        GradationMP = %^K %^P %^T ;
        VowelHarmonyMP = %^A %^O %^U ;
        VclessStop = k p t ;
        Liquid = l r ;
        HighLabial = u y ;
        BackVowel = a o u ;

For all the gradation rules, the right side context determines whether weak or strong grade will be used. Because the context is same for all variants and all grades, and we want to avoid rewriting this potentially complex context for every rule, twolc provides way to define macros. The basic rule is that weak grade appears in in closed syllabi, so we sketch a definition of closed syllable macros named ClosedOffset and ClosedCoda like so:

  Definitions
        ClosedOffset = Cons: [ Cons: | #:0 ] ;
        ClosedCoda = Vowel ClosedOffset ;

The main part for selecting variant of gradation lies now in the left context, so next thing to do is plain write the contexts. In simplest cases we have context of one leftwards, that is when weak grade directly assimilates to a nasal. Combining this left context and right context of closed syllable for right context we get

k:g <=> n _ ClosedCoda
p:m <=> m _ ClosedCoda
and so forth, which we can further abbreviate, because of shared contexts:
  "Gradation after nasals"
  Cx:Cy <=> Cz _ ClosedCoda ; where Cx in GradationMP
                                  Cy in ( g m n )
                                  Cz in ( n m n )
                            matched ;

   ...

Filling in the gaps, we get complete ruleset which covers the variation needed for these examples. Only potential thing to do is to declare the variables used in rules, such as Cx, Cy and Cz. This is done in beginning of file in rule-variables declaration, that looks like this:

  Rule-variables
        Cx Cy Cz Vx Vy ;

When rule-variables are declared separately, the compiler can handle them specially and print more useful and sane debugging output.

The complete rule file can be then saved into a file. A suggested practice is to use recognisable suffix, such as .twol, and descriptive filename: gradation-rules.twol. Now we can process the rule file with hfst-twolc to produce transducer binary that can be combined lexicon to produce a lexical transducer. Hfst-twolc has calling conventions like any average GNU style command line tool, to name input file you use --input switch, to name output file --output is used, and for rest of the options you may call --help. Most typically you will always need only to call hfst-twolc --input rules.twol --output rules.hfst. For this Finnish gradation case it is:

  hfst-twolc --input gradation-rules.twol --output gradation-rules.hfst --resolve

Even in this relatively simple case hfst-twolc will produce quite a lot of output, as it lists all sets, definitions and rules as it reads them. Hfst-twolc also prints out rules that contain conflicting contexts in this part.

Conflicting contexts per se are not an error, in fact for non-trivial grammars they are likely to appear. It is however useful to check through these conflicts to ensure that what results is expected. For conflicts the --resolve option is meaningful, though. If it is not given, hfst-twolc does not modify rules to resolve conflicts, which will usually lead to unwanted results.

Intersecting Compose

Finally, to combine the ruleset with lexicon, an intersecting composition tool hfst-compose-intersect may be used. Use of this tool is relatively trivial as the calling conventions use same GNU or unix style standards as other our tools. Lexicon is given with parameter -1 and rulesets with -2. So to make our previous example with one ruleset and lexicon, a call hfst-compose-intersect -1 gradation-lexicon.hfst -2 gradation-rules.hfst .

After informative printout of compilation process, the result will be an inverted lexical transducer or a morphological generator. To test it you may use hfst-invert and then hfst-lookup, or to see contents, use hfst-fst2strings. All of these tools follow the same calling conventions, detailed in HfstCommandLineTools.


-- KristerLinden - 09 Sep 2008

Topic attachments
I Attachment Action Size Date Who Comment
Unknown file formatlexc grad.lexc manage 0.4 K 2010-12-31 - 13:14 UnknownUser Example lexc source file
Topic revision: r15 - 2014-02-19 - ErikAxelson
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback