hfst-train-tagger
The tagger programs are part of HFST (University of Helsinki Finite State Transducer interface) finite state toolkit distribution; a tool that creates weighted transducer for suffix based guessing. This tool is licenced under GNU GPL version 3 (other licences may be available at request). The licence can be found from file COPYING.
Installation
Configure hfst using
--enable-tagger
.
Usage
Note: currently
hfst-train-tagger
is called
hfst-train-tagger.bat
on Windows.
Usage: hfst-train-tagger [OPTIONS...] [INFILE]
Compile training data file into an hfst part-of-speech tagger.
Common options:
-h, --help Print help message
-V, --version Print version info
-v, --verbose Print verbosely while processing
-q, --quiet Only print fatal erros and requested output
-s, --silent Alias of --quiet
Input/Output options:
-i, --input=INFILE Read input transducer from INFILE
-o, --output=OUTFILE Write output transducer to OUTFILE
Report bugs to <hfst-bugs@helsinki.fi> or directly to our bug tracker at:
<https://sourceforge.net/tracker/?atid=1061990&group_id=224521&func=browse>
hfst-train-tagger uses a training data file and a tagger configuration file to train a tagger. The taggers resemble HMM models. The probabilities of the possible tags given a word is decided on basis of surrounding tag and word context.
Each tagger consists of submodels, whose combined effect determines the probability for tags. The submodels are specified in a configuration file.
Tagging is accomplished using
hfst-tag.
Specifying a submodel
Hfst taggers consist of submodels each of which assigns some weight for every possible tag of a word in a sentence. The most likely tagging of the words in a sentence is the one which maximizes the combined probabilities the tags of the individul words in the sentence.
Both surrounding words and tags can be used when determining the probabilities. E.g. a basic second order HMM tagger consists of three submodels:
- Tag given the two preceeding tags
- Tag given preceeding tag
- Tag regardless of context
In an hfst tagger, the conditional probability

is specified as the quotient of two counts

and

computed from a training corpus.

is an abstract symbol, which implies that the word form or the tag in a give position is disregarded.
Training data format
The training data format consists of lines with a word (or other token such as a comma) and a tag separated by a tab. Sentences are separated by separator lines consistin of two
||
symbols separated by a tab. If your model is of order n, you need n separator lines. E.g. a second order HMM model, needs two separator lines between sentences.
An example of training data from the Penn Treebank:
. .
|| ||
|| ||
Shorter JJR
maturities NNS
are VBP
considered VBN
a DT
sign NN
of IN
rising VBG
rates NNS
because IN
portfolio NN
managers NNS
can MD
capture VB
higher JJR
rates NNS
sooner RB
. .
|| ||
|| ||
The DT
average JJ
Configuration file format
Configuration files are always named
hfst_tagger_config
. When
hfst-train-tagger
is run in a given directory, that directory is searched for a configuration file.
Configuration files consist of lines with four fields
- The name of the model.
- The numerator simplifier.
- The denominator simplifier.
- A weight for the submodel.
P(T_i-2, T_i-1, T_i | T_i-2, T_i-1) NONE TAG NONE TAG NONE TAG NONE TAG NONE TAG NONE NONE 0.75
P(T_i-1, T_i | T_i-1) NONE TAG NONE TAG NONE TAG NONE NONE 0.15
P(T_i) NONE TAG NONE NONE 0.1
--
MiikkaSilfverberg - 2012-08-28