hfst-train-tagger

The tagger programs are part of HFST (University of Helsinki Finite State Transducer interface) finite state toolkit distribution; a tool that creates weighted transducer for suffix based guessing. This tool is licenced under GNU GPL version 3 (other licences may be available at request). The licence can be found from file COPYING.

Installation

Configure hfst using --enable-tagger.

Usage

Note: currently hfst-train-tagger is called hfst-train-tagger.bat on Windows.

Usage: hfst-train-tagger [OPTIONS...] [INFILE]
Compile training data file into an hfst part-of-speech tagger.

Common options:
  -h, --help             Print help message
  -V, --version          Print version info
  -v, --verbose          Print verbosely while processing
  -q, --quiet            Only print fatal erros and requested output
  -s, --silent           Alias of --quiet
Input/Output options:
  -i, --input=INFILE     Read input transducer from INFILE
  -o, --output=OUTFILE   Write output transducer to OUTFILE


Report bugs to <hfst-bugs@helsinki.fi> or directly to our bug tracker at:
<https://sourceforge.net/tracker/?atid=1061990&group_id=224521&func=browse>

hfst-train-tagger uses a training data file and a tagger configuration file to train a tagger. The taggers resemble HMM models. The probabilities of the possible tags given a word is decided on basis of surrounding tag and word context.

Each tagger consists of submodels, whose combined effect determines the probability for tags. The submodels are specified in a configuration file.

Tagging is accomplished using hfst-tag.

Specifying a submodel

Hfst taggers consist of submodels each of which assigns some weight for every possible tag of a word in a sentence. The most likely tagging of the words in a sentence is the one which maximizes the combined probabilities the tags of the individul words in the sentence.

Both surrounding words and tags can be used when determining the probabilities. E.g. a basic second order HMM tagger consists of three submodels:

  1. Tag given the two preceeding tags $P(t_2 | t_1,\ t_0)$
  2. Tag given preceeding tag $P(t_1 | t_0)$
  3. Tag regardless of context $P()$

In an hfst tagger, the conditional probability $P(t_2 | t_1,\ t_0)$ is specified as the quotient of two counts $C(NONE,\ t_0,\ NONE,\ t_1,\ NONE,\ t_2)$ and $C(NONE,\ t_0,\ NONE\, t_1,\ NONE,\ NONE)$ computed from a training corpus. $NONE$ is an abstract symbol, which implies that the word form or the tag in a give position is disregarded.

Training data format

The training data format consists of lines with a word (or other token such as a comma) and a tag separated by a tab. Sentences are separated by separator lines consistin of two || symbols separated by a tab. If your model is of order n, you need n separator lines. E.g. a second order HMM model, needs two separator lines between sentences.

An example of training data from the Penn Treebank:

.       .
||      ||
||      ||
Shorter JJR
maturities      NNS
are     VBP
considered      VBN
a       DT
sign    NN
of      IN
rising  VBG
rates   NNS
because IN
portfolio       NN
managers        NNS
can     MD
capture VB
higher  JJR
rates   NNS
sooner  RB
.       .
||      ||
||      ||
The     DT
average JJ

Configuration file format

Configuration files are always named hfst_tagger_config. When hfst-train-tagger is run in a given directory, that directory is searched for a configuration file.

Configuration files consist of lines with four fields

  1. The name of the model.
  2. The numerator simplifier.
  3. The denominator simplifier.
  4. A weight for the submodel.

P(T_i-2, T_i-1, T_i | T_i-2, T_i-1)     NONE TAG NONE TAG NONE TAG      NONE TAG NONE TAG NONE NONE     0.75
P(T_i-1, T_i | T_i-1)                   NONE TAG NONE TAG               NONE TAG NONE NONE              0.15
P(T_i)                                  NONE TAG                        NONE NONE                       0.1

-- MiikkaSilfverberg - 2012-08-28