Difference: HfstApplicationTutorial (1 vs. 9)

Revision 92017-02-20 - SamHardwick

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"
Line: 451 to 451
 The pattern matching application hfst-pmatch can be used to build a trivial tokenizer by itself:
Added:
>
>
set need-separators off
 define nonword Whitespace | Punct; define word LC([nonword | #]) [\nonword]+ RC([nonword | #]); define token word | Punct;
Line: 474 to 475
 :
Changed:
<
<
To directly produce the tokens in some serialized format, there's hfst-proc2 (to be renamed in the near future). By default it outputs one token per line, and nothing else (meaning text not identified as being part of a token is suppressed). There are various output options to support piping the stream of tokens for further processing.
>
>
To directly produce the tokens in some serialized format, there's hfst-tokenize. By default it outputs one token per line, and nothing else (meaning text not identified as being part of a token is suppressed). There are various output options to support piping the stream of tokens for further processing. It also allows you to skip writing the tokenization rules by directly running it on a HFST transducer, in which case it produces a simple default tokenizer based on the accepted strings in the transducer and typical punctuation awareness.
  Of course, tokenization isn't quite as simple as that. In English, for example, applications typically expect contractions (like isn't) comprising multiple words to produce multiple tokens (is and n't). We might redefine our tokenizer to know about the finite number of English contractions:
Added:
>
>
set need-separators off
 define contraction {'m} | {n't} | {'re} | {'s} | {'ll} | {'d} | {'ve} ; define nonword Whitespace | Punct; define word LC([nonword | #]) [\nonword]+ RC([[nonword - "'"] | # | contraction]);
Line: 491 to 493
 Besides refinements like this, or allowing hyphens inside words, or phenomena like compounding in other languages, we may want to consider some multiword units to be single tokens. For example, for must purposes it would make sense to consider New York or car park to be single tokens. Issues like this may become complex and require a considerable amount of linguistic data (not to mention decision-making). If we have a satisfactory finite-state dictionary (or morphology) that does what we want, we can largely piggy-back our tokenizing on top of it:
Added:
>
>
set need-separators off
 define morphology @bin"/path/to/morphology.hfst";

! If the morphology has morphological tags in the output, we might

Revision 82015-09-02 - SamHardwick

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"
Line: 289 to 289
 (Commentary to come)
Changed:
<
<
Define bnc @bin"bnc.hfst"; Define adjectives bnc .o. [?* { } "AJ0"|"AJS"]; Define positive_baseforms [[bnc .o. [?* { } "AJ0"] ] .o. [ { } "AJ0"] -> ""].l; Define superlative_baseforms [[bnc .o. [?* { } "AJS"] ] .o. [ { } "AJS"] -> ""].l; Define intersection positive_baseforms & superlative_baseforms; Define inverted_bnc [bnc].i; Define pos_to_super [bnc .o. [intersection { } "AJ0":"AJS"]] .o. inverted_bnc; Define TOP pos_to_super;
>
>
define bnc @bin"bnc.hfst"; define adjectives bnc .o. [?* { } "AJ0"|"AJS"]; define positive_baseforms [[bnc .o. [?* { } "AJ0"] ] .o. [ { } "AJ0"] -> ""].l; define superlative_baseforms [[bnc .o. [?* { } "AJS"] ] .o. [ { } "AJS"] -> ""].l; define intersection positive_baseforms & superlative_baseforms; define inverted_bnc [bnc].i; define pos_to_super [bnc .o. [intersection { } "AJ0":"AJS"]] .o. inverted_bnc; regex pos_to_super;
 

With Python via the HFST API

Line: 451 to 451
 The pattern matching application hfst-pmatch can be used to build a trivial tokenizer by itself:
Changed:
<
<
Define nonword Whitespace | Punct; Define word LC([nonword | #]) [\nonword]+ RC([nonword | #]); Define token word | Punct; Define TOP token EndTag(token) 0:"\n";
>
>
define nonword Whitespace | Punct; define word LC([nonword | #]) [\nonword]+ RC([nonword | #]); define token word | Punct; regex token EndTag(token) 0:"\n";
 

We require strings of characters other than whitespace or punctuation to be surrounded by whitespace or punctuation, or an input boundary, and consider them to be tokens. Punctuation characters are tokens of their own. Finally we wrap each token in tags and emit a newline.

Line: 479 to 479
 Of course, tokenization isn't quite as simple as that. In English, for example, applications typically expect contractions (like isn't) comprising multiple words to produce multiple tokens (is and n't). We might redefine our tokenizer to know about the finite number of English contractions:
Changed:
<
<
Define contraction {'m} | {n't} | {'re} | {'s} | {'ll} | {'d} | {'ve} ; Define nonword Whitespace | Punct; Define word LC([nonword | #]) [\nonword]+ RC([[nonword - "'"] | # | contraction]); Define token word | Punct | contraction; Define TOP token;
>
>
define contraction {'m} | {n't} | {'re} | {'s} | {'ll} | {'d} | {'ve} ; define nonword Whitespace | Punct; define word LC([nonword | #]) [\nonword]+ RC([[nonword - "'"] | # | contraction]); define token word | Punct | contraction; regex token;
 

Now we allow the right boundary of a token to be a contraction as well, and list contractions as tokens.

Line: 491 to 491
 Besides refinements like this, or allowing hyphens inside words, or phenomena like compounding in other languages, we may want to consider some multiword units to be single tokens. For example, for must purposes it would make sense to consider New York or car park to be single tokens. Issues like this may become complex and require a considerable amount of linguistic data (not to mention decision-making). If we have a satisfactory finite-state dictionary (or morphology) that does what we want, we can largely piggy-back our tokenizing on top of it:
Changed:
<
<
Define morphology @bin"/path/to/morphology.hfst";
>
>
define morphology @bin"/path/to/morphology.hfst";
  ! If the morphology has morphological tags in the output, we might ! want to get rid of them with eg.
Changed:
<
<
! Define morphology_input [morphology].i;
>
>
! define morphology_input [morphology].i;
 
Changed:
<
<
Define nonword Whitespace | Punct; Define contraction {'m} | {n't} | {'re} | {'s} | {'ll} | {'d} | {'ve} ;
>
>
define nonword Whitespace | Punct; define contraction {'m} | {n't} | {'re} | {'s} | {'ll} | {'d} | {'ve} ;
  ! We might have a more specific idea of punctuation, contractions ! or some other orthographic phenomenon from the morphology, ! which could be extracted using tags at this point
Changed:
<
<
Define ruleword LC([nonword | #]) [\nonword]+ RC([[nonword - "'"] | # | contraction]); Define morphoword morphology RC([[nonword - "'"] | # | contraction]);
>
>
define ruleword LC([nonword | #]) [\nonword]+ RC([[nonword - "'"] | # | contraction]); define morphoword morphology RC([[nonword - "'"] | # | contraction]);
  ! We consider contractions here, but they may be implicitly handled by the morphology too
Changed:
<
<
Define token morphoword | ruleword | Punct; Define TOP token;
>
>
define token morphoword | ruleword | Punct; regex token;
 

Revision 72015-08-19 - SamHardwick

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"
Line: 448 to 448
 

Building a tokenizer

Changed:
<
<
The pattern matching application hfst-pmatch can be used to build a reasonable tokenizer by itself:
>
>
The pattern matching application hfst-pmatch can be used to build a trivial tokenizer by itself:
 
Define nonword Whitespace | Punct;
Line: 457 to 457
 Define TOP token EndTag(token) 0:"\n";
Added:
>
>
We require strings of characters other than whitespace or punctuation to be surrounded by whitespace or punctuation, or an input boundary, and consider them to be tokens. Punctuation characters are tokens of their own. Finally we wrap each token in tags and emit a newline.
 This ruleset, compiled with hfst-pmatch2fst and run with hfst-pmatch --extract produces only the tokens without whitespace:
Line: 472 to 474
 :
Changed:
<
<
Besides refinements like allowing hyphens inside words, we may want to consider some multiword units to be single tokens. For example, for must purposes it would make sense to consider "New York" or "car park" to be single tokens.
>
>
To directly produce the tokens in some serialized format, there's hfst-proc2 (to be renamed in the near future). By default it outputs one token per line, and nothing else (meaning text not identified as being part of a token is suppressed). There are various output options to support piping the stream of tokens for further processing.

Of course, tokenization isn't quite as simple as that. In English, for example, applications typically expect contractions (like isn't) comprising multiple words to produce multiple tokens (is and n't). We might redefine our tokenizer to know about the finite number of English contractions:

 
Added:
>
>
Define contraction {'m} | {n't} | {'re} | {'s} | {'ll} | {'d} | {'ve} ;
Define nonword Whitespace | Punct;
Define word LC([nonword | #]) [\nonword]+ RC([[nonword - "'"] | # | contraction]);
Define token word | Punct | contraction;
Define TOP token;

Now we allow the right boundary of a token to be a contraction as well, and list contractions as tokens.

Besides refinements like this, or allowing hyphens inside words, or phenomena like compounding in other languages, we may want to consider some multiword units to be single tokens. For example, for must purposes it would make sense to consider New York or car park to be single tokens. Issues like this may become complex and require a considerable amount of linguistic data (not to mention decision-making). If we have a satisfactory finite-state dictionary (or morphology) that does what we want, we can largely piggy-back our tokenizing on top of it:

Define morphology @bin"/path/to/morphology.hfst";

! If the morphology has morphological tags in the output, we might
! want to get rid of them with eg.
! Define morphology_input [morphology].i;

Define nonword Whitespace | Punct;
Define contraction {'m} | {n't} | {'re} | {'s} | {'ll} | {'d} | {'ve} ;

! We might have a more specific idea of punctuation, contractions
! or some other orthographic phenomenon from the morphology,
! which could be extracted using tags at this point

Define ruleword LC([nonword | #]) [\nonword]+ RC([[nonword - "'"] | # | contraction]);
Define morphoword morphology RC([[nonword - "'"] | # | contraction]);

! We consider contractions here, but they may be implicitly handled by
the morphology too

Define token morphoword | ruleword | Punct;
Define TOP token;
 

Revision 62015-02-23 - SamHardwick

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"
Line: 120 to 120
  return True

def process_surface(surface, baseform, tag):

Changed:
<
<
if tag = "NP0":
>
>
if tag = "NP0" and surface = "I":
  surface = surface.lower() return surface.replace(":", "\:")

Revision 52015-02-16 - SamHardwick

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"
Line: 354 to 354
 One way to describe all the mistakes we want to be able to correct is with so-called replace rules:
Changed:
<
<
"a" -> ["b" | "c" | "d" | ...]::1.0 ,, "b" -> ["a" | "c" | ... ]::1.0 ,,
>
>
"a" -> ["" | "a" | "b" | "c" | "d" | ...]::1.0 ,, "b" -> ["" | "b" | "a" | "c" | ... ]::1.0 ,,
 etc.
Changed:
<
<
Here the left-hand side of each rule represents the error, and the right-hand side represents corrections. These may be given individual weights (perhaps calculated from an error corpus), in this case each rule simply has a weight of 1.0. The two commas at the end, ,,, indicate that the rules operate in parallel, ie. not one after the other, but all together.
>
>
Here the left-hand side of each rule represents the (possible) error, and the right-hand side represents (possible) corrections. These may be given individual weights (perhaps calculated from an error corpus), in this case each rule simply has a weight of 1.0. The two commas at the end, ,,, indicate that the rules operate in parallel, ie. not one after the other, but all together.
 
Changed:
<
<
This is sufficient to obtain a ranking of results, but since the rules can operate anywhere, there will be a very large number of results. The ruleset may be restricted to only consider some number of changes by adding a special marker to the correction rule, eg. , and composing the ruleset with a filter that allows only a certain number of corrections to pass. In addition (or alternatively), hfst-ospell may be directed to restrict its search by weight.
>
>
This is sufficient to obtain a ranking of results, but since the rules can operate anywhere, there will be a very large number of results. The ruleset may be restricted to only consider some number of changes by adding a special marker to the correction rule, eg. , and composing the ruleset with a filter that allows only a certain number of corrections to pass. In addition (or alternatively), hfst-ospell may be directed to restrict its search by weight. It's convenient to generate the rules with a script. A specialised script for generating error models is distributed with hfst-ospell, but here we use replace rules and the swig interface (we could as well have given the rules from the script to hfst-regexp2fst.

import libhfst

# our alphabet
alphabet = "abcdefghijklmnopqrstuvwxyz"
# we could also have used a structure with corrections enumerated along with
# weights, and iterated through that

# a special symbol used for restricting the number of corrections
corr = ' "<CORR>" '

# a function that produces the replace rules from an alphabet
def replace_rules(alphabet):
    corrections = ""
    weight = 1.0
    # first, the empty string may become the empty string anywhere
    corrections += '"" -> \t[ "" |\n'
    for a in alphabet:
    # insertions
        corrections += '\t[ "' + a + '" ' + corr + ' ]::' + str(weight) + ' |\n'
    # trim the extra left by the last pass
    corrections = corrections[:-3]
    corrections += ' ] ,,\n'
    for a in alphabet:
    # identity, which gets no weight
        corrections += '"' + a + '" -> \t[ "' + a + '" |\n'
    # deletion
        corrections += '\t[ ""' + corr + ']::' + str(weight) + ' |\n'
        for b in alphabet:
        #substitutions
            if a == b:
                # we already handled identity
                continue
            corrections += '\t[ "' + b + '"' + corr + ']::' + str(weight) + ' |\n'
        corrections = corrections[:-3]
        corrections += ' ] ,,\n'
    corrections = corrections[:-3]
    return corrections

corrections = replace_rules(alphabet)
# The following rule admits two corrections (where <CORR> is replaced by the empty
# string), each surrounded by anything.
corr_counter = '[ [? - "<CORR>"]* ( "<CORR>":0 ) [? - "<CORR>"]* ]^2'

regex = libhfst.XreCompiler()
corrector = regex.compile(corrections)
correction_filter = regex.compile(corr_counter)
corrector.compose(correction_filter)
outstream = libhfst.HfstOutputStream("errmodel.hfst",
                                     libhfst.TROPICAL_OPENFST_TYPE)
outstream.redirect(corrector)

With the alphabet reduced to just "ab", the rules look like this:

"" ->    [ "" |
   [ "a"  "<CORR>"  ]::1.0 |
   [ "b"  "<CORR>"  ]::1.0 ] ,,
"a" ->    [ "a" |
   [ "" "<CORR>" ]::1.0 |
   [ "b" "<CORR>" ]::1.0 ] ,,
"b" ->    [ "b" |
   [ "" "<CORR>" ]::1.0 |
   [ "a" "<CORR>" ]::1.0 ] 

As a validation of the basic idea, we can compose a misspelled word with the error model, and compose the result of that with the dictionary:

$ echo "appple" | hfst-strings2fst | hfst-compose -1 - -2 errmodel.hfst | hfst-compose -1 - -2 /srv/data/bnc/bnc.hfst | hfst-fst2strings -w

appple:nipple NN1   15.5273
appple:ripple NN1   15.0107
appple:ripple VVI   17.8066
...
appple:apple NN1   11.5068

Here we get information about tags, which we don't really need, and not necessarily relevant weights (here we might want to recalculate them using frequencies of the surface strings only).

Ideally, we'd like to have more control over the results, sort them by weight and get them more quickly. That can be achieved with hfst-ospell.

 

Building a tokenizer

Revision 42015-02-09 - SamHardwick

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"
Line: 227 to 227
  As you can see, there's a lot of garbage and very uncommon adjectives. In any corpus, uncommon words comprise most of the vocabulary.
Changed:
<
<
If we repeat the previous for superlatives (AJS), we can extract the shared subset, ie. the intersection, of base forms that have both a positive and superlative reading. This requires composing the collections of positive and superlative adjectives with a rule that removes the tag:
>
>
If we repeat the previous for superlatives (AJS), we can extract the shared subset, ie. the intersection, of base forms that have both a positive and superlative reading. This requires composing the collections of comparative and superlative adjectives with a rule that removes the tag:
 
Changed:
<
<
echo '[ { } "AJ0"] -> ""' | hfst-regexp2fst | hfst-compose -1 positive_adjectives.hfst -2 - | hfst-project -p output > positive_baseforms.hfst
>
>
echo '[ { } "AJC"] -> ""' | hfst-regexp2fst | hfst-compose -1 comparative_adjectives.hfst -2 - | hfst-project -p output > comparative_baseforms.hfst
 
Changed:
<
<
Here we pipe the output of hfst-regexp2fst directly to hfst-compose, and with -2 - tell it to take the second argument from the pipe. The { } in the expression is for the space that goes between the base form and the tag. hfst-project removes the surface form and leaves just the analysis side (in this case, the lemma). positive_baseforms.hfst has strings like
>
>
Here we pipe the output of hfst-regexp2fst directly to hfst-compose, and with -2 - tell it to take the second argument from the pipe. The { } in the expression is for the space that goes between the base form and the tag. hfst-project removes the surface form and leaves just the analysis side (in this case, the lemma). comparative_baseforms.hfst has strings like
 
Changed:
<
<
scudéry-influenced chicago-born crankish
>
>
furry juicy homely zwing good hygromete
 

Having done the same with superlatives, we can take the intersection:

Line: 250 to 253
 It contains more typical adjectives:
Changed:
<
<
grand healthy vile urgent murky fruity fussy loathsome dingy itchy
>
>
profound bald merry handsome juicy vast jolly
 
Changed:
<
<
We need one more resource for our positive-to-superlative converter: a generating morphology that will produce the desired surface forms from a given base form and tag. That's simply the inverse of our original morphology:
>
>
We need one more resource for our comparative-to-superlative converter: a generating morphology that will produce the desired surface forms from a given base form and tag. That's simply the inverse of our original morphology:
 
hfst-invert bnc.hfst > bnc-generate.hfst
Line: 271 to 271
 Now we can add to the end of our intersection a space and a rule that changes positive tags to superlative ones, and compose that with the morphology on the left side and with its inverse on the right:
Changed:
<
<
echo '{ } "AJ0":"AJS"' | hfst-regexp2fst | hfst-concatenate -1 intersection.hfst -2 - | hfst-compose -1 bnc.hfst -2 - | hfst-compose -1 - -2 bnc-generate.hfst > positive_to_superlative.hfst
>
>
echo '{ } "AJC":"AJS"' | hfst-regexp2fst | hfst-concatenate -1 intersection.hfst -2 - | hfst-compose -1 bnc.hfst -2 - | hfst-compose -1 - -2 bnc-generate.hfst > comparative_to_superlative.hfst
 

Our result has pairs like this:

Changed:
<
<
blind:blindest vile:vilest classy:classiest thirsty:thirstiest husky:huskiest gaudy:gaudiest jammy:jammiest pithy:pithiest
>
>
older:oldest younger:youngest farther:furthest newer:newest better:best
 

With hfst-xfst and hfst-pmatch

Revision 32015-02-03 - SamHardwick

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"
Line: 43 to 43
 
Changed:
<
<
python parse_bnc.py | hfst-stringst2fst --disjunct-strings --multichar-symbols="c5_tags.txt" | hfst-minimize > bnc.hfst
>
>
python parse_bnc.py | hfst-strings2fst --disjunct-strings --multichar-symbols="c5_tags.txt" | hfst-minimize > bnc.hfst
 
Line: 156 to 156
 # the frequency is used, resulting in a so-called log weight.

for (surface, baseform, tag) in analysis_count.keys():

Changed:
<
<
weight = -1*log(float(surface_count[surface])/total_tokens)
>
>
if surface_count[surface] > 1: # we may want to only count words that occur at least twice weight = -1*log(float(analysis_count[(surface, baseform, tag)])/total_tokens)
  print surface.encode("utf-8") + ":" + baseform.encode("utf-8") + " " + tag + "\t" + str(weight)
Line: 183 to 185
 Consider adjectives. The BNC tags adjectives according to comparation: they can be positive (good, AJ0), comparative (better, AJC) or superlative (best, AJS). For example the entry for "better" in our morphology is (discarding ambiguous tags):
Changed:
<
<
better better NN1 7.916992 better better VVB 7.916992 better better VVI 7.916992 better good AJC 7.916992 better well AV0 7.916992
>
>
better good AJC 8.430664 better well AV0 8.845703 better better VVB 13.671875 better better NN1 14.198242 better better VVI 14.480469
 
Changed:
<
<
The first reading represents an alternative spelling of "bettor", the second a finite verb ("to better oneself"), the third is the comparative adjective and the fourth is an adverb. Weights are all equal because we used surface strings for weighting.
>
>
The first reading is the comparative form of the adjective "good", the second an adverb, the third a finite verb, the fourth an alternative spelling of "bettor" and the fifth an infinitive verb ("to better oneself"). The weights represent the joint probability of the surface form, the base form and the tag occurring together.
 
Changed:
<
<
Since the BNC contains both positive and superlative readings for many adjectives, it should be possible to build to transducer that converts between those cases. First we'll build a filter that will let through analysis strings that contains anything followed by either of the adjective tags. The simplest way to do that is with hfst-regexp2fst. The regular expression may be saved to a file and given as an argument to hfst-regexp2fst, or directly echoed to it via the command line:
>
>
Since the BNC contains both comparative and superlative readings for many adjectives, it should be possible to build to transducer that converts between those cases. First we'll build a filter that will let through analysis strings that contains anything followed by either of the adjective tags. The simplest way to do that is with hfst-regexp2fst. The regular expression may be saved to a file and given as an argument to hfst-regexp2fst, or directly echoed to it via the command line:
 
Changed:
<
<
echo '?* "AJ0"' | hfst-regexp2fst > positive_filter.hfst
>
>
echo '?* "AJC"' | hfst-regexp2fst > comparative_filter.hfst
 
Changed:
<
<
The expression ?* "AJ0" accepts anything, followed by the tag "AJ0". If we save the result to adjective_filter.hfst, we may now compose the morphology with that to get our result:
>
>
The expression ?* "AJC" accepts anything, followed by the tag "AJC". If we save the result to comparative_filter.hfst, we may now compose the morphology with that to get our result:
 
Changed:
<
<
hfst-compose -1 bnc.hfst -2 positive_filter.hfst > positive_adjectives.hfst
>
>
hfst-compose -1 bnc.hfst -2 comparative_filter.hfst > comparative_adjectives.hfst
 

Let's see if the result looks right. hfst-fst2strings can be used to extract a random sample from the transducer like this:

Changed:
<
<
hfst-fst2strings -r 10 positive_adjectives.hfst
>
>
hfst-fst2strings -r 10 comparative_adjectives.hfst
 

One run looked like this:

Changed:
<
<
gsi:gsi AJ0 ij:ij AJ0 dx2-based:dx2-based AJ0 f-86:F-86 AJ0 trans-pennine:Trans-Pennine AJ0 nkwain:nkwain AJ0 nv-a850:nv-a850 AJ0 yzordderrexian:yzordderrexian AJ0 whig-liberal:Whig-Liberal AJ0 wüppertal:Wüppertal AJ0
>
>
sparser:sparse AJC dossier:dossy AJC politer:polite AJC grazier:grazy AJC oder:od AJC mer:m AJC voider:void AJC ureter:urete AJC
 

As you can see, there's a lot of garbage and very uncommon adjectives. In any corpus, uncommon words comprise most of the vocabulary.

Revision 22015-02-02 - SamHardwick

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"
Line: 348 to 348
 

Renaming tags

Changed:
<
<

Using the morphology to generate a spell checker

>
>

Using a morphology to generate a spell checker

A misspelled word can be changed into a correctly spelled one in any number of combinations of atomic operations, like inserting a character somewhere, deleting one or substituting a character for another. Of course, it can also be changed into a correctly spelled word that wasn't intended. A simple approach to finding the most likely correct spelling is to consider a larger number of mistakes to be less likely, and to consider more common words to be more likely. If we were to compose a transducer that corrects mistakes weighted by number of changes with a morphology weighted by surface form, we'd get something that corrects spelling mistakes ranked by probability. Unfortunately the result would tend to be very large, since it would contain all the conceivable misspellings we'd want to consider.

hfst-ospell is a tool that performs composition during runtime between the misspelled word, the error model and the morphology. It has been packaged for use with word processors and web browsers, but in this example we'll just use it on the command line.

One way to describe all the mistakes we want to be able to correct is with so-called replace rules:

"a" -> ["b" | "c" | "d" | ...]::1.0 ,,
"b" -> ["a" | "c" | ... ]::1.0 ,,
etc.

Here the left-hand side of each rule represents the error, and the right-hand side represents corrections. These may be given individual weights (perhaps calculated from an error corpus), in this case each rule simply has a weight of 1.0. The two commas at the end, ,,, indicate that the rules operate in parallel, ie. not one after the other, but all together.

This is sufficient to obtain a ranking of results, but since the rules can operate anywhere, there will be a very large number of results. The ruleset may be restricted to only consider some number of changes by adding a special marker to the correction rule, eg. , and composing the ruleset with a filter that allows only a certain number of corrections to pass. In addition (or alternatively), hfst-ospell may be directed to restrict its search by weight.

Building a tokenizer

The pattern matching application hfst-pmatch can be used to build a reasonable tokenizer by itself:

Define nonword Whitespace | Punct;
Define word LC([nonword | #]) [\nonword]+ RC([nonword | #]);
Define token word | Punct;
Define TOP token EndTag(token) 0:"\n";

This ruleset, compiled with hfst-pmatch2fst and run with hfst-pmatch --extract produces only the tokens without whitespace:

He took his vorpal sword in hand:

<token>He</token>
<token>took</token>
<token>his</token>
<token>vorpal</token>
<token>sword</token>
<token>in</token>
<token>hand</token>
<token>:</token>

Besides refinements like allowing hyphens inside words, we may want to consider some multiword units to be single tokens. For example, for must purposes it would make sense to consider "New York" or "car park" to be single tokens.

 

Revision 12015-01-15 - SamHardwick

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="HfstHome"

HFST Application tutorial

Introduction

This tutorial has two aims: to demonstrate different approaches to constructing and utilizing a morphology (corpus-based, dictionary-based, paradigm-based, etc.) for different languages (English, Finnish, Swedish, Sámi, etc.), and in the process to acquint the reader with HFST tools and APIs. The end result should be a flexible base for building more advanced resources, of which we will give some simple demonstrations.

Building an English morphology from a tagged corpus

In this section we will use a large, tagged corpus of English to generate a weighted morphology. The corpus is processed with conventional scripting; we extract surface forms, base forms, tags, and calculate frequencies. hfst-strings2fst is used to generate the morphology itself.

This approach doesn't generally require a high standard of linguistic knowledge (or even much knowledge of the language in question), but it does require some profiency with programming. The quality of the end result obviously depends on the quality of the corpus, and the various choices made therein will be adopted by the morphology. With a large corpus the coverage will generally be very good, with the exception that some paradigms will be incomplete (eg. plausible derivations which don't happen to appear in the corpus will be absent from the morphology). With some effort it is possible to detect and remedy these absences.

As an example we will use the British National Corpus (BNC). This is a very large corpus that has recently become free of charge to use. It is formatted in XML, which is a common choice. We don't provide a general approach to corpus processing here, but dealing with the BNC brings up many issues and should provide a decent guide. See http://www.natcorp.ox.ac.uk/ for information about the BNC.

Corpus preprocessing

Element and tag selection

The BNC is a rather richly categorised and analysed corpus. Texts come with metadata, sub-elements and identification of multi-word units. Much of that information will be thrown away at this stage, but there are a few important decisions to be made.

Firstly, there are two types of text: that collected from written sources, and that transcribed from spoken language (about 10% of the total). The fundamental unit of both is the word, represented by the XML tag . If we decide to incorporate both, it is sufficient to simply include every tag in our process. If we want to exclude the spoken sources, we must inspect the XML a little more closely. The entire corpus is one element tagged , which contains a sequence of elements which correspond to .xml files in the BNC XML distribution. Each of these contains either a (written text) element or a (spoken text) element. So on a text-by-text basis, your preprocessing script may discard the elements as desired.

Secondly, there are two types of tagging: coarse part-of-speech and a 57-tag (plus 4 punctuation tags) system known variously as C5, the BNC Basic Tagset and CLAWS4 (which is in fact the name of a tagger used to tag the corpus). The coarse tags are under the attribute "pos" for each word element, and the C5 tags under the attribute "c5".

Thirdly, there are ambiguous tags and punctuation tags. In the corpus there appear an additional 30 tags of the form "AJ0-AV0". This indicates that the correct tag is probably either AJ0 (adjective) or AV0 (adverb), but that the tagging software was not certain enough to make a choice. The tag appearing in the first position in such cases is considered to be the more likely one. For the morphology builder, there are several reasonable choices:

  1. discard the entire word (hopefully the corpus is large enough without these cases)
  2. count the word as representing both tags (for the sake of maximum coverage, though this will distort weights, especially surface weights)
  3. count the word as representing the more likely case
  4. use the ambiguity tags as they are (this only makes sense if you essentially intend to clone the corpus in another form)

Punctuation appears in elements rather than tags, so there is a straightforward choice to either include them or not in your material.

Fourthly, there are multiword elements. These are tagged and contain several elements. Handling these properly has relevance to some applications, eg. tokenization.

Scripting

In general the reader needs some experience with programming, but we provide some code snippets as examples. The following example is in Python. It reads the BNC XML files and prints to standard output a format readable by hfst-strings2fst, namely the surface string, followed by a colon, followed by the analysis string, followed by a tab character, followed by the weight as a floating point number. The output may be directed to a file or piped directly to hfst-strings2fst, as in:


python parse_bnc.py | hfst-stringst2fst --disjunct-strings --multichar-symbols="c5_tags.txt" | hfst-minimize > bnc.hfst

The meaning of --disjunct-strings is to produce a single transducer which is the disjunction of all the lines produced by our script. Without that option, hfst-strings2fst produces an archive containing one transducer per line. --multichar-symbols is a way to tell hfst-strings2fst to consider the tags in the BNC to be singular symbols. It takes one symbol per line. This is not necessary, but it does simplify further processing of the corpus. It guarantees that operations not explicitly involving the full tag name won't affect them, eg. the first character in AJ0 is A, which may be a subject of a rule we want to write, but AJ0 probably won't be confused with anything else. If your tags are likely to be confused with something else (like N for noun), it may be a good idea to wrap them in some kind of markup, like .

hfst-minimize minimizes the transducer. This is an operation that takes some time, but makes the result smaller and faster to operate on in the future.

Here's the parsing script (the full corpus takes quite a while to process):

import os # for manipulating the file system
from math import log # for calculating weights
from lxml import etree # for xml processing

bnc_dir = "/path/to/bnc"

latin_alphabet = set("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
bad_tags = set(["ZZ0", "UNC"]) # Uncategorised tokens and alphabetic symbols

# We collect all the xml filenames
corpusfiles = []
for (dirname, dirs, files) in os.walk(os.path.abspath(bnc_dir)):
    # make sure we only get xml files
    for filename in filter(lambda x: x.endswith(".xml"), files):
       corpusfiles.append(os.path.join(dirname, filename))

total_tokens = 0

# various dicts to collect frequency information
tag_count = {} # counts of the tags
surface_count = {} # counts of surface forms
surfacetag_count = {} # counts of the conjunction of surface and tag
analysis_count = {} # counts of surface, baseform and tag together

# iterating through the corpus
for filename in corpusfiles:
    tree = etree.parse(filename)
    text = tree.find("wtext")
    if text == None:
        # There was no wtext element, so it's a spoken text element.
        # This is where you may decide to skip processing this element.
        text = tree.find("stext")
        # if we still have nothing, something's wrong
        assert text != None
    # Walk through the text element in document order.
    # To include punctuation, go through the "c" elements too.
    for word in text.getiterator("w"):
        tag = word.get("c5")
        try:
            surface = word.text.strip() # omit whitespace between tokens
            baseform = word.get("hw").strip()
        except AttributeError:
            # The corpus may contain blank tokens, these can't be stripped
            # and raise an AttributeError. We discard them.
            continue
        # This is a good time to inspect the word for suitability and do
        # some processing. For example, omit numerical tokens (CRD, ORD),
        # alphabetical symbols (ZZ0), uncategorised and probably non-lexical
        # entries (UNC). The characters present may also be restricted to
        # those present in English. Surface forms may be lower-cased unless the
        # tag is for a proper noun (NP0). We also need to escape :-characters
        # so hfst-strings2fst isn't confused by them.

        def suitable(surface, baseform, tag):
            if tag in bad_tags:
                return False
            if tag != "NP0":
                # There are some dubious cases due to tokenization issues;
                # to avoid these, we demand words start with latin letters
                # unless they're proper nouns. This also deals with the many
                # non-spelled numbers in the corpus.
                if surface[0] not in latin_alphabet:
                    return False
            return True

        def process_surface(surface, baseform, tag):
            if tag != "NP0":
                surface = surface.lower()
            return surface.replace(":", "\:")

        def process_base(surface, baseform, tag):
            return baseform.replace(":", "\:")
        
        def process_tag(surface, baseform, tag):
            # deal with ambiguous tags
            if "-" in tag:
                # take the more likely reading
                return tag[:tag.index("-")]
            return tag
            
        if not suitable(surface, baseform, tag):
            continue
        surface = process_surface(surface, baseform, tag)
        baseform = process_base(surface, baseform, tag)
        tag = process_tag(surface, baseform, tag)
        total_tokens += 1
        tag_count[tag] = tag_count.setdefault(tag, 0) + 1
        surface_count[surface] = surface_count.setdefault(surface, 0) + 1
        surfacetag_count[(surface, tag)] = \
            surfacetag_count.setdefault((surface, tag), 0) + 1
        analysis_count[(surface, baseform, tag)] = \
            analysis_count.setdefault((surface, baseform, tag), 0) + 1

# Now we calculate weights and print each item we wish to include in the
# morphology. Weights can be calculated by various principles - the frequency
# of the surface form, the frequency of the surface form with that particular
# analysis, or that particular tag, or the frequency of the base form.
# These choices have obvious implications for applications. Here we use the
# surface form alone. In all cases the negative of the natural logarithm of
# the frequency is used, resulting in a so-called log weight.

for (surface, baseform, tag) in analysis_count.keys():
    weight = -1*log(float(surface_count[surface])/total_tokens)
    print surface.encode("utf-8") + ":" + baseform.encode("utf-8") + " " + tag + "\t" + str(weight)

Weight selection and calculation

For some applications, such as spell checking, the main question is "how likely is this surface string in this corpus?" The answer is given by the ratio the value in surfacetag_count and the total number of tokens. For ranking different readings by base form, tag, or both, values from those collections may be used.

Our finite-state operations don't deal with true probabilities (for various reasons), but with non-negative weights which represent dimensionless penalties for each reading. The higher the weight, the less likely the reading. In this case we used the negative value of the natural logarithm function to convert frequencies to weights:


weight = -1*log(float(surface_count[surface])/total_tokens)

Frequencies are between 0 and 1, so the values from the log function will be negative. To get positive weights, we multiply by negative one.

Using tags to operate on the morphology

With command-line tools

Now that we have our morphology, let's explore it a bit with some command-line tools.

Consider adjectives. The BNC tags adjectives according to comparation: they can be positive (good, AJ0), comparative (better, AJC) or superlative (best, AJS). For example the entry for "better" in our morphology is (discarding ambiguous tags):

better   better NN1   7.916992
better   better VVB   7.916992
better   better VVI   7.916992
better   good AJC   7.916992
better   well AV0   7.916992

The first reading represents an alternative spelling of "bettor", the second a finite verb ("to better oneself"), the third is the comparative adjective and the fourth is an adverb. Weights are all equal because we used surface strings for weighting.

Since the BNC contains both positive and superlative readings for many adjectives, it should be possible to build to transducer that converts between those cases. First we'll build a filter that will let through analysis strings that contains anything followed by either of the adjective tags. The simplest way to do that is with hfst-regexp2fst. The regular expression may be saved to a file and given as an argument to hfst-regexp2fst, or directly echoed to it via the command line:

echo '?* "AJ0"' | hfst-regexp2fst > positive_filter.hfst

The expression ?* "AJ0" accepts anything, followed by the tag "AJ0". If we save the result to adjective_filter.hfst, we may now compose the morphology with that to get our result:

hfst-compose -1 bnc.hfst -2 positive_filter.hfst > positive_adjectives.hfst

Let's see if the result looks right. hfst-fst2strings can be used to extract a random sample from the transducer like this:

hfst-fst2strings -r 10 positive_adjectives.hfst

One run looked like this:

  gsi:gsi AJ0
  ij:ij AJ0
  dx2-based:dx2-based AJ0
  f-86:F-86 AJ0
  trans-pennine:Trans-Pennine AJ0
  nkwain:nkwain AJ0
  nv-a850:nv-a850 AJ0
  yzordderrexian:yzordderrexian AJ0
  whig-liberal:Whig-Liberal AJ0
  wüppertal:Wüppertal AJ0

As you can see, there's a lot of garbage and very uncommon adjectives. In any corpus, uncommon words comprise most of the vocabulary.

If we repeat the previous for superlatives (AJS), we can extract the shared subset, ie. the intersection, of base forms that have both a positive and superlative reading. This requires composing the collections of positive and superlative adjectives with a rule that removes the tag:

echo '[ { } "AJ0"] -> ""' | hfst-regexp2fst | hfst-compose -1 positive_adjectives.hfst -2 - | hfst-project -p output > positive_baseforms.hfst

Here we pipe the output of hfst-regexp2fst directly to hfst-compose, and with -2 - tell it to take the second argument from the pipe. The { } in the expression is for the space that goes between the base form and the tag. hfst-project removes the surface form and leaves just the analysis side (in this case, the lemma). positive_baseforms.hfst has strings like

  scudéry-influenced
  chicago-born
  crankish

Having done the same with superlatives, we can take the intersection:

hfst-intersect -1 positive_baseforms.hfst -2 superlative_baseforms.hfst > intersection.hfst

It contains more typical adjectives:

  grand
  healthy
  vile
  urgent
  murky
  fruity
  fussy
  loathsome
  dingy
  itchy

We need one more resource for our positive-to-superlative converter: a generating morphology that will produce the desired surface forms from a given base form and tag. That's simply the inverse of our original morphology:

hfst-invert bnc.hfst > bnc-generate.hfst

Now we can add to the end of our intersection a space and a rule that changes positive tags to superlative ones, and compose that with the morphology on the left side and with its inverse on the right:

echo '{ } "AJ0":"AJS"' | hfst-regexp2fst | hfst-concatenate -1 intersection.hfst -2 - | hfst-compose -1 bnc.hfst -2 - | hfst-compose -1 - -2 bnc-generate.hfst > positive_to_superlative.hfst

Our result has pairs like this:

  blind:blindest
  vile:vilest
  classy:classiest
  thirsty:thirstiest
  husky:huskiest
  gaudy:gaudiest
  jammy:jammiest
  pithy:pithiest

With hfst-xfst and hfst-pmatch

(Commentary to come)

Define bnc @bin"bnc.hfst";
Define adjectives bnc .o. [?* { } "AJ0"|"AJS"];
Define positive_baseforms [[bnc .o. [?* { } "AJ0"] ] .o. [ { } "AJ0"] -> ""].l;
Define superlative_baseforms [[bnc .o. [?* { } "AJS"] ] .o. [ { } "AJS"] -> ""].l;
Define intersection positive_baseforms & superlative_baseforms;
Define inverted_bnc [bnc].i;
Define pos_to_super [bnc .o. [intersection { } "AJ0":"AJS"]] .o. inverted_bnc;
Define TOP pos_to_super;

With Python via the HFST API

import libhfst

def load_transducer(filename):
    return libhfst.HfstTransducer(libhfst.HfstInputStream(filename))

def lookup(transducer, istr):
    retval = []
    for pair in libhfst.vectorize(transducer.lookup(istr)):
        output = ''.join(pair[1])
        weight = pair[0]
        retval.append(output + " " + str(weight))
    return retval

def filter_by_tag(t, tag):
    t.compose(regex.compile('?* { } "' + tag + '"'))
    return t

def remove_tag(t, tag):
    t.compose(regex.compile('[{ } "' + tag + '"] -> ""'))
    t.output_project()
    return t

bnc = load_transducer("bnc.hfst")
regex = libhfst.XreCompiler()
adjectives = bnc.compose(regex.compile('?* { } ["AJ0"|"AJS"]'))
positive_baseforms = remove_tag(filter_by_tag(
    libhfst.HfstTransducer(adjectives), "AJS"), "AJS")
superlative_baseforms = remove_tag(filter_by_tag(
    libhfst.HfstTransducer(adjectives), "AJS"), "AJS")
intersection = positive_baseforms.intersect(superlative_baseforms)
inverted_adjectives = libhfst.HfstTransducer(adjectives)
inverted_adjectives.invert()
super_to_pos = libhfst.HfstTransducer(adjectives)
tag_super_to_pos = libhfst.HfstTransducer(intersection)
tag_super_to_pos.concatenate(regex.compile('{ } "AJS":"AJ0"'))
super_to_pos = libhfst.HfstTransducer(adjectives)
super_to_pos.compose(tag_super_to_pos)
super_to_pos.compose(inverted_adjectives)
print lookup(super_to_pos, "best")

Renaming tags

Using the morphology to generate a spell checker

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback