HFST: HFST Transducer Formats

General

This page discusses differences between different backend transducer formats and things that must be taken into consideration when converting between formats. Many differences relate to the symbols and alphabets of transducers. Conversion is always carried out through HFST's own transducer format, HfstBasicTransducer. In this way, there is a need for 2 x (N - 1) different conversion functions instead of N x (N - 1), where N is the number of different formats.

Some terminology: A transition of a transducer consists of an input and an output symbol. Internally, a symbol is a number, for instance 1 or 124. The symbol table of a transducer relates a number with a string and vice versa. A string is a way to represent the symbol, for instance "<epsilon>" or "a". The alphabet of a transducer is the set of all symbols that are known to the transducer. Different FST libraries use these terms differently and often they have only one data structure that functions as a symbol table and an alphabet at the same time.

OpenFst

An OpenFst transducer can have an input SymbolTable and an output SymbolTable, both of which can also be undefined. A SymbolTable is a set of string-number pairs and it serves as an alphabet and a symbol table at the same time. In OpenFst, it is possible to use a same number to mean different symbols in the input and output side of a transition. For example the internal transition "1:1" could mean "a:A". It is also possible to use just numbers in transducers without interpreting them as strings if the symbol tables have been left undefined. The number 0 is always reserved for the epsilon symbol.

SFST

An SFST transducer has one Alphabet, a data structure that serves as a symbol table and an alphabet and also contains a set all symbol pairs that occur in the transducer. The number 0 and corresponing string "<>" are always reserved for the epsilon symbol.

foma

A foma tranducer has one sigma, a data structure that serves as a symbol table and an alphabet. The following numbers and their corresponding strings are always reserved for use and included in the alphabet of a transducer:

number string explanation
0 "@_EPSILON_SYMBOL_@" The epsilon.
1 "@_UNKNOWN_SYMBOL_@" The unknown symbol, any symbol not known to the alphabet of the transducer.
2 "@_IDENTITY_SYMBOL_@" The identity symbol, an unknown symbol that must be the same on both sides of a transition.

HfstBasicTransducer

An HfstBasicTransducer has a set of strings that functions as the alphabet of the transducer. All HfstBasicTransducers on the same session share a common data structure that maps numbers to strings, a global symbol table. The same numbers and strings are reserved and always known as in a foma transducer, i.e. the epsilon, unknown and identity symbols.

Hfst optimized lookup transducer

...

HfstTransducer

An HfstTransducer can have six different implementation formats:

format name backend transducer
SFST_TYPE SFST
FOMA_TYPE foma
TROPICAL_OFST_TYPE OpenFst with tropical weights
LOG_OFST_TYPE OpenFst with logarithmic weights
HFST_OL HFST optimized lookup unweighted
HFST_OLW HFST optimized lookup weighted

A binary HfstTransducer consists of an HFST header and the backend implementation transducer.

All HfstTransducers of type "TROPICAL_OFST_TYPE", "LOG_OFST_TYPE" and "FOMA_TYPE" always have the same special symbols reserved in their symbol tables and alphabets as in a foma transducer:

number string explanation
0 "@_EPSILON_SYMBOL_@" The epsilon.
1 "@_UNKNOWN_SYMBOL_@" The unknown symbol, any symbol not known to the alphabet of the transducer.
2 "@_IDENTITY_SYMBOL_@" The identity symbol, an unknown symbol that must be the same on both sides of a transition.

An HfstTransducer of type "SFST_TYPE" has the following special symbols:

number string explanation
0 "<>" The epsilon.
1 "@_UNKNOWN_SYMBOL_@" The unknown symbol, any symbol not known to the alphabet of the transducer.
2 "@_IDENTITY_SYMBOL_@" The identity symbol, an unknown symbol that must be the same on both sides of a transition.

Nevertheless, an SFST_TYPE HfstTransducer hides this exception with the epsilon symbol under its interface which works as if it used the "@_EPSILON_SYMBOL_@" as a string for the epsilon symbol. This was considered easier than interfering with the SFST source code that has already epsilon string defined as "<>".

A TROPICAL_OFST_TYPE or LOG_OFST_TYPE HfstTransducer has only the input SymbolTable defined and all symbols on the input and output side use it. An alternative solution would have been to use both an input and an output SymbolTable, but this decision was already done a long time ago, so we stick to it.

Conversions between HfstTransducer and backend formats

When we want to write a binary HfstTransducer without the HFST header, i.e. in its plain backend format, some things must be taken into consideration before writing:

type write as note
TROPICAL_OFST_TYPE OpenFst Take a copy of the input SymbolTable and give it as the value of the output SymbolTable.
LOG_OFST_TYPE OpenFst The same as above.
FOMA_TYPE foma nothing
SFST_TYPE SFST nothing

Since OpenFst assumes that a transducer has an input and an output symbol table (unless it has neither), we take a copy of the input SymbolTable and give it as the value of the output SymbolTable before writing. This will make sure that all OpenFst functionalities work correctly.

When we want to read a backend transducer, i.e. a transducer with no HFST header, some things must also be done:

type read as note
OpenFst TROPICAL_OFST_TYPE Convert OpenFst transducer into an HfstBasicTransducer and then into TROPICAL_OFST_TYPE.
OpenFst LOG_OFST_TYPE See above.
foma FOMA_TYPE nothing
SFST SFST_TYPE Convert SFST transducer into an HfstBasicTransducer and then into SFST_TYPE.

When we are reading an OpenFst transducer with no HFST header, it is possible that it has separate input and output SymbolTables or that it has no input or output SymbolTable or that epsilon is coded differently than "@_EPSILON_SYMBOL_@". It is also possible that numbers 1 and 2 are reserved for other use than "@_UNKNOWN_SYMBOL_@" and "@_IDENTITY_SYMBOL_@", which can also happen when we are reading an SFST transducer with no HFST header. Since conversion into an HfstBasicTransducer is carried out using the symbol strings, not numbers, we can be sure that all special symbols and numbers are used in the right way.

One open question still remains. When we write a transducer in its plain backend format, special symbols unknown and identity are included in the symbol table and alphabet. However, in SFST and OpenFST they have no specail meaning. It is not sure if they sould be removed trom the symbol table and alphabet, unless they appear in the transitions of the transducer.

An issue with foma

Foma writes its binary transducers in gzipped format using the gz tools. However, we experienced problems when trying to write to standard output or read from standard in with gz tools (foma tools do not write to or read from standard streams). So we choose to write, and accordingly read, foma transducers unzipped when writing or reading binary HfstTransducers of FOMA_TYPE. As a result, when we write an HfstTransducer of FOMA_TYPE in its plain backend format, the user must gzip it themselves before it can be used by foma tools. Similarily, a foma transducer must be gunzipped before it can be read by HFST tools.

Suppose we have a foma transducer transducer.foma. The name of the file must be appended a .gz extension so that the program 'gunzip' knows it is a zipped file. The commands

mv transducer.foma transducer.foma.gz
gunzip transducer.foma.gz

overwrite the original file transducer.foma with an unzipped version of the same file. Now the file can be used by HFST.

Suppose we have an FOMA_TYPE HFST transducer transducer.hfst. The command

gzip transducer.hfst

will create a file transducer.hfst.gz that can be used by foma.


-- ErikAxelson - 2011-02-03
Topic revision: r1 - 2011-02-03 - ErikAxelson
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback