HFST: Using weights

Weights

From end-user perspective, weight tells how probable a word or its analysis is. The weight can be thought as a penalty, i.e. words/analyses with a bigger weight are less probable. Accordingly, when there are several analyses for a word, they are printed in ascending order so that the most probable ones come first. It is possible to weight lexemes or grammatical rules, making it easier to disambiguate among several possible analyses for a given word. Of course weights can also be used in generating word forms.

For weights, we use the tropical semiring. When there are several paths or transitions that only differ by their weight, the tropical semiring chooses the one with the lowest weight. All HFST command line tools by default support weights. If weights are not specified anywhere, the tools just operate with zero weights. There are three back-end implementation format available for almost all HFST tools: sfst, openfst-tropical and foma, openfst-tropical being the weighted one and used by default.

Using weights in hfst-regexp2fst and hfst-xfst

The mechanism for adding weights in hfst-regexp2fst and hfst-xfst is the :: operator which can be used for assigning weights to individual transitions or to any regular expression in brackets, i.e.

a::weight
a:b::weight
[ any regular expression ]::weight

The weights are most often from the tropical semiring. The tropical weight is represented as a float, i.e. one or more digits that may be preceded by a minus or plus sign and followed by a comma followed by at least one digit. For example the regular expression

[ a b:c::0.5 d::0.3 ]::0.2

will produce a transducer that maps abd to acd with weight 0.5 + 0.3 + 0.2 = 1.0. In this example, we basically have a transition a:a with no weight followed by a transition b:c with weight 0.5 followed by transition d:d with weight 0.3 leading to a final state with weight 0.2. However, it is possible that operations that are called afterwards, e.g. minimization, modify the exact positions of weights in the transducer.

A more complex expression

[ [ foo:bar::-1.15 ]::+0.15 baz::0.5 ]::0.7

will yield a transducer that maps foobaz to barbaz with weight -1.15 + 0.15 + 0.5 + 0.7 = 0.2.

Note that using weights is possible only when using the implementation openfst-tropical (and basically openfst-log which is not very well supported). Inserting weights with unweighted implementations, i.e. sfst or foma, has no effect.

Using weights in hfst-lexc

Weights can be defined for individual entries and they can also used in regular expressions. See hfst-lexc.

Using weights in hfst-sfstpl2fst aka hfst-calculate

Currently weights are not supported.

Using weights in hfst-twolc

It may become possible to add weights to rules, which determine the relative importance of a rule in a conflict-situation. At this time it is only possible to compile weighted rules with zero weights. See hfst-twolc.

Using weights in hfst-strings2fst

The weight of a string can be given after the string separated by a tabulator. See hfst-strings2fst.

Using weights in hfst-txt2fst

In AT&T format, weights for transitions and final states can be given after the transition or final state line separated by a tabulator. In prolog format, weights can be given as last argument of compounds arc and final. See hfst-txt2fst.

Shortcomings and caveats

Negation

Negation is based on subtraction so it does not preserve weights. Usually this is not a problem but must be taken into consideration when using double negation. With unweighted transducers double negation produces an equivalent transducer, but with weighted ones the weights are lost. For example the expression

~[~[a:b::2.5]]

yields a transducer equivalent to a:b::0, not a:b::2.5.

Rules

Weights are not yet fully defined for all rule types. See hfst-xfst for examples of using weights in rules.

Containment

The containment operators $.[t] (contains t once) and $?[t] (contains t once or does not contain t) support weights. The operator $[t] (contains one or more ts) ignores weights, if you want to weight each occurrence of t with a weight of w, you can use the operator $::w[t]. Note that the containment operators use shortest matching and allow overlapping when matching the expressions. E.g.

$.[a a]

will not accept "aaa" since it contains two "aa" that overlap. Similarly, the weighted expression

$::1[a a]

will accept "aaa" but with weight 2 because there are two "aa" each with weight 1.

The expression

$.[a::3|[a a]::1]

will not accept "aa" because it contains two "a" according to shortest matching (that does not consider weights).

Determinization and minimization

All weighted transducers are not determinizable (and thus not minimizable), so weights must be encoded when performing determinization or minimization. Encoding means that weights are regarded as a part of transition symbol pairs and cannot be freely moved. In some cases this results in minimized transducers that are not fully minimal.

For example, the determinization algorithm does not halt if given input such as

0 1 a a 1
0 2 a a 2
1 1 b b 3
1 3 c c 5
2 2 b b 4
2 3 d d 6
3 5 e e 0
3 4 e e 0
4 0 
5 0

and will eventually run out of memory (example modified from Mohri, Efficient Algorithms for Testing the Twins Property, page 8). If weights are encoded, the algorithm produces the result

0 1 a a 1
0 2 a a 2
1 1 b b 3
1 3 c c 5
2 2 b b 4
2 3 d d 6
3 4 e e 0
4 0

where weights have effectively been ignored during determinization and determinization has only been applied to the parts of the transducer that do not have weights.

You can set weight encoding on in hfst-xfst with

set encode-weights ON

and in hfst-regexp2fst, hfst-determinize and hfst-minimize with --encode-weights.

Projection

When taking projection, the weights are copied as such. If an expression is divided into its input and output projections and they are combined with cross product, the weights will be duplicated. For example the expression

[a:b::5].u .x. [a:b::5].l

will yield a transducer equivalent to [a:b::10], not [a:b::5].

Another issue with projection is that it can produce weighted epsilon cycles. For example the expression

[[a:0::-1]*].l

will produce an epsilon cycle with a weight of minus one. When the epsilons are removed as part of determinization or minimization, it will result in infinite negative weights. (Positive weights are not a problem in tropical semiring because they are regarded as penalties and discarded in the first place.) See section below for more information:

Epsilon cycles with negative weights

Consider the transducer

0 0 ε ε -1
0 1 a a  0
1 0

Performing epsilon removal or determinization on it will theoretically produce the result

0 1 a a -INF
1 0

where INF is the negative infinity. Similarly, for minimization we get:

0 1 a a 0
1 -INF

In practice, INF will be the smallest float value that the environment where HFST is running supports. This is basically the correct result given the finite values of a float, but very slow to calculate. The above transducer will minimize in a couple of seconds, but adding more states will considerably slow down the algorithm. Epsilon cycles with negative weights are relatively rare, but can e.g. occur when creating rules and taking projection on the false side.

Nbest

hfst-fst2strings has the option

--n-best=N
which does not work if the transducer has loops with negative weights.

Equivalence

hfst-compare works poorly with weighted transducers due to precision issues.


-- ErikAxelson - 2013-10-14