HFST: Runtime Transducer Binary Format

The HFST Runtime API is separate from the HFST Tool Development APIs. The Runtime API is intended for using ready-made transducers.

Runtime License

The HFST Runtime API has been fully coded in the HFST project and contains no external code. All rights belong to the University of Helsinki. The HFST Runtime API is licensed under the Apache license. Other licenses are negotiable.

Functions

See HfstOptimizedLookupFormat for function definitions.

Runtime Transducer Binary Format

Type definitions

typedef int32 byte_order_marker
typedef int32 version_number
typedef int16 symbol_number
typedef int16 input_symbol_number
typedef int16 symbol_pair_number
typedef int32 transition_index_number
typedef int32 transition_number
typedef int32 truth_value
typedef float (this is 4 bytes) weight

Byte-level description

The binary file has a header which states:

  • if the transducer has been created on a platform with the same endianness as the current platform (byte_order_marker).
  • the RuntimeTransducer binary format version (version_number)
  • if the transducer is deterministic (truth_value).
  • if the transducer is minimal (truth_value).
  • if the transducer is cyclic (truth_value).
  • if the transducer is weighted (truth_value).
  • the number of symbols used in the transducer (symbol_number).
  • the number of input symbols (input_symbol_number).
  • the number of symbol pairs (symbol_pair_number).
  • the size of the transition index table (transition_index_number).
  • the size of the transition table (transition_number).

Following the header, there is a table of symbols referred to in the transducer. All values of the type symbol_number refer to this symbol table.

The symbol table is followed by a table of the input symbols of the transducer. Each entry in the input symbol table is of the type symbol_number. All values of the type input_symbol_number refer to the input symbol table.

After the input symbols, a table of symbol pairs follows. Each pair consist of two values of the type symbol_number. All values of type symbol_pair_number refer to the symbol pair table.

The states and transitions of the transducer are encoded by two tables, namely the transition index table and the transition table.

Given a particular state in the transducer, there is a position in the transition index table, where the transition indexes of the state begin. Each entry in the transition index table is a pair consisting of an input symbol (input_symbol_number) and an index number (transition_index_number). The index number refers to a position in the transition table.

A state, whose transition indexes begin at position M in the transition index table, has transitions with a particular input symbol n (input_symbol_number), iff the input symbol at position M+n+1 is equal to n. The transitions for input symbol n are found in the transition table beginning at the position given by the index number (transition_index_number).

The transition table consists of pairs, where the first member of a pair is a symbol pair (symbol_pair_number) and the second member of the pair is the transition target (transition_number). The transition target refers to a position in the transition index table.

RUNTIME_TRANSDUCER: HEADER SYMBOL_TABLE INPUT_SYMBOL_NUMBER_TABLE SYMBOL_PAIR_TABLE 
                    TRANSITION_INDEX_TABLE TRANSITION_TABLE 

HEADER: BYTE_ORDER_MARK VERSION_NUMBER DETERMINISTIC MINIMAL CYCLIC WEIGHTED 
        NUMBER_OF_ALL_SYMBOLS  NUMBER_OF_INPUT_SYMBOLS NUMBER_OF_SYMBOL_PAIRS 
        SIZE_OF_TRANSITION_INDEX_TABLE SIZE_OF_TRANSITION_TABLE

SYMBOL_TABLE: SYMBOL[NUMBER_OF_ALL_SYMBOLS]

INPUT_SYMBOL_NUMBER_TABLE: SYMBOL_NUMBER[NUMBER_OF_INPUT_SYMBOLS]

SYMBOL_PAIR_TABLE: SYMBOL_PAIR[NUMBER_OF_SYMBOL_PAIRS]

TRANSITION_INDEX_TABLE: TRANSITION_INDEX[SIZE_OF_TRANSITION_INDEX_TABLE]

TRANSITION_TABLE: TRANSITION[SIZE_OF_TRANSITION_TABLE]

TRANSITION_INDEX: INPUT_SYMBOL_NUMBER TRANSITION_NUMBER

TRANSITION: SYMBOL_PAIR_NUMBER TRANSITION_INDEX_NUMBER (WEIGHT)

BYTE_ORDER_MARK: byte_order_marker
VERSION: version_number
DETERMINISTIC, MINIMAL, CYCLIC, WEIGHTED: truth_value

TRANSITION_INDEX_NUMBER, SIZE_OF_TRANSITION_INDEX_TABLE: transition_index_number
TRANSITION_NUMBER, SIZE_OF_TRANSITION_TABLE: transition_number

SYMBOL: symbol_number
SYMBOL_PAIR: symbol_number symbol_number

SYMBOL_NUMBER, NUMBER_OF_ALL_SYMBOLS: symbol_number
INPUT_SYMBOL_NUMBER, NUMBER_OF_INPUT_SYMBOLS: input_symbol_number
SYMBOL_PAIR_NUMBER, NUMBER_OF_SYMBOL_PAIRS: symbol_pair_number

Explanation of the unweighted runtime binary format

The unweighted runtime format transducer consists of a header and five tables. The header contains information about the HFST transducer and the lengths of the five tables, so that reading the remaining portion of the transducer into memory only requires one input-operation.

The first three tables of the runtime transducer, the symbol table, the input symbol number table and the symbol pair table encode symbols, input symbol numbers and symbol pairs as numbers. These numbers are used in the two final tables, which encode the states and transitions of the transducer.

Each state in a runtime transducer is represented by an array, which contains as many places as there are input symbols in the alphabet of the transducer. Each input symbol is encoded by a input symbol number and the state array may be indexed using these numbers. The array is constructed so that one can find out whether there are transitions with a given input symbol by indexing the array using the input symbol number. If there are transitions with the input symbol, one can obtain a position in the final table of the runtime transducer, the transition table. The transitions of a state for a given input symbol are encoded in the transition table as a list beginning at the indicated position.

Each filled entry in a state array indicates which input symbol it corresponds to. Since the state arrays are not likely to be very densely populated, they are compacted in the fourth table of the runtime transducer, the transition index table, by placing arrays on top of each other so that no position is occupied by more than one filled array entry of any particular state.

The last table in the runtime transducer, the transition table, represents a list of transitions in which the transitions with a particular input symbol follow each other as consecutive entries. Hence all transitions with a particular input symbol in a state may be found, if the index of the first transition with the input symbol is known. The transitions consist of the symbol pair number of the transition and the target state of the transition in the transition index table.

An example of an unweighted transducer

In the example below, the list of symbols refers to symbol names listed in an external file demonstrating how to use any proprietary multi-character symbol encoding that the user might see fit.

We give a more thorough explanation of the runtime format and an annotated hexdump of the runtime transducer small_test.fst. The file small_test.fst was produced using the command sequence:

cat small_test.txt | hfst-txt2fst -A -a small_test.symbols | hfst-write-runtime-transducer small_test.symbols > small_test.fst

The transducer in text format

The text format transducer small_test.txt has two states: state 0 and state 1.

0       1       a       a
0
1       0       b       c

State 0 is the start state. It is a final state. State 1 is non-final.

There are two transitions in the transducer. One from state 0 to state 1 with pair a:a and another from state 1 to state 0 with pair b:c.

The symbol file of the transducer

The symbol numbering small_test.symbols of the transducer

0 <>
1 b
2 c
3 a

This is solution demonstrates how multi-character symbols in a transducer can be encoded with an external representation. Alternatively, one can make the external symbol table into a separate transducer (aka a tokenizer) and compose it with the original transducer to create a transducer with e.g. unicode input, in which case the symbol table of the transducer contains the four byte integers of the predefined unicode code points.

HEADER

The header contains information about the transducer and its binary representation. A runtime transducer binary file is essentially composed of five tables (the symbol table, the input symbol table, the symbol pair table the transition index table and the transition table). When the sizes of these tables are known, the transducer may be read in one large chunk.

The header contains four kinds of information:

  • The runtime transducer file specifies:
    • whether it was written on a platform with the same byte order as the platform which you are using.
    • which version of the runtime format it represents (currently this is 1).
  • The topology of the transducer indicates:
    • whether the transducer is deterministic
    • whether it is minimal
    • whether it is cyclic
    • whether it is weighted
  • The alphabet of the transducer specifies:
    • how many symbols there are in the alphabet. This is the size of the symbol table.
    • how many input symbols there are in the alphabet. This is the size of the input symbol table.
    • how many symbol pairs there are in the alphabet. This is the size of the symbol pair table.
  • The size of the remaining part of the transducer stated as:
    • the length of transition index table and
    • the length of the transition table.

The header of the binary small_test.fst:

0001 0000 - Byte order is the same as on this computer.
0001 0000 - Runtime transducer version 1.
0001 0000 - This transducer is deterministic.
0001 0000 - This transducer is minimal.
0001 0000 - This transducer is cyclic.
0000 0000 - This transducer is not weighted.

0004 - This transducer has four symbols <>, b c and a.
0003 - The transducer has three input symbols.
0002 - There are two symbol pairs in the transducer.

0008 0000 - The size of the transition index table is 8
0002 0000 - The size of the transition table is 2

SYMBOL TABLE (indexing begins at 0)

The symbol table is meant to be a table of unsigned integer representations of the symbols used in the transducer transitions. The size of the table NUMBER_OF_ALL_SYMBOLS is given in the header.

In this particular case, the symbol table contains the numbers from 0 to NUMBER_OF_ALL_SYMBOLS - 1. The symbols themselves are given in a separate file (small_test.symbols).

the symbol table of the binary small_test.fst:

0. 0000 0000 - The symbol, whose number is 0 is <> (as given in small_test.symbols).
1. 0001 0000 - The symbol, whose number is 1 is b (as given in small_test.symbols).
2. 0002 0000 - The symbol, whose number is 2 is c (as given in small_test.symbols).
3. 0003 0000 - The symbol, whose number is 0 is a (as given in small_test.symbols).

INPUT SYMBOL NUMBER TABLE (indexing begins at 0)

The input symbols in a runtime binary file need to be densely numbered to save space. At the same time, not every symbol in a transducer is necessarily an input symbol. Hence the input symbol number table gives a new dense numbering of symbols, which only takes into account input symbols.

The table consists of symbol numbers. The symbol number at position I in this table is the Ith input symbol of the transducer. The position in the input symbol number table is used for accessing the transition index table.

The input symbol 0 always corresponds to symbol number 0, regardless of whether there actually are transitions whose input symbol number is 0. This is because the symbol for epsilon must always correspond to the symbol number 0 as it needs to be handled in a special manner.

the input symbol number table of the binary small_test.fst:

0. 0000 - The input symbol number 0 is symbol number 0 (<>).
1. 0001 - The input symbol number 1 is symbol number 1 (b).
2. 0003 - The input symbol number 2 is symbol number 3 (a).

SYMBOL PAIR TABLE (indexing begins at 1)

The pairs of symbols labeling the transitions of a transducer are coded into numbers, which are used in the transition table of the runtime transducer.

The indexing of pairs begins at 1. This allows us to use the index 0 as a terminator for a list of transitions in the transition table. A terminator is required to prevent a run-on when the same pair is the last transition of one state and the first transition of the next state in the transition table.

Each entry in the symbol pair table consists of two symbol numbers. These are positions in the symbol table.

the symbol pair table of the binary small_test.fst:

1. 0001 0002 - Symbol pair number 1: <symbol number 1, symbol number 2> i.e. b:c.
2. 0003 0003 - Symbol pair number 2: <symbol number 3, symbol number 3> i.e. a:a.

TRANSITION INDEX TABLE (indexing begins at 0)

The states and transitions of a runtime transducer are simultaneously represented by two tables. The first table is the transition index table. The second is the transition table, which is described in the following section.

For every state in a transducer, the transition index table contains an array which indicates whether the state has transitions whose input symbol is a particular symbol.

The transition index table is composed of pairs of the form

INPUT_SYMBOL_NUMBER TRANSITION_NUMBER

The INPUT_SYMBOL_NUMBER is a position in the input symbol number table and TRANSITION_NUMBER is a position in the transition table.

The array corresponding to state S may be indexed with an input symbol number I. If the transition index table has the INPUT_SYMBOL_NUMBER I in position S+I+1, the transitions from state S with input symbol number I begin at position TRANSITION_NUMBER in the transition table.

The entry at position 0 in the array corresponding to state S indicates whether the state is a final state. If the state is final, the entry is

MAX_SHORT  1

If the state is non-final, the entry is

MAX_SHORT  0

the transition index table of the binary small_test.fst:

0. ffff 0001 0000 - The transition index table of the start state 0 begins here.
                    State 0 is a final state.
1. 0000 0000 0000 - No transitions from state 0 with input symbol number 0 (i.e. <>).

2. ffff 0000 0000 - The transition index table of state 1 begins here.
                    State 1 is not a final state.
3. 0002 0001 0000 - Transitions for state 0 with input symbol number 2 (i.e. symbol number 3,
                    i.e. symbol a) start at index 1 in the transition table.

4. 0001 0002 0000 - Transitions for state 1 with input symbol number 1 (i.e. symbol number 1,
                    i.e. symbol b) start at index 2 in the transition table.
5. 0000 0000 0000 - There are no transitions from state 1 with input symbol number 0.

6. 0000 0000 0000 - There are often some empty transition indeces at the end of the
                    transition index table.
7. 0000 0000 0000 -   

TRANSITION TABLE (indexing begins at 1).

The transition table consists of entries of the form

SYMBOL_PAIR_NUMBER TRANSITION_INDEX_NUMBER

where the SYMBOL_PAIR_NUMBER is the position of a symbol pair in the symbol pair table, and the TRANSITION_INDEX_NUMBER is a position in the transition index table.

The transitions of a state S are represented as a dense list in the transition table using consecutive entries. All of the transitions whose input symbol number is the same are guaranteed to follow one another. Given a state S, whose first transition with input symbol I occurs at index N in the transition table, there are M transitions with input symbol I, iff the input symbols of the transitions at the positions

N, N+1, N+2, ..., N+(M-1)

in the transition table are I and

  • the input symbol of the transition at position N+M is not I,
  • or the transition table ends at N+M,
  • or the pair number of the transition at position N+M is 0.

Note that 0 does not correspond to any pair as the indexing of the symbol pair table begins at 1.

the transition table of the binary small_test.fst:

1. 0002 0002 0000 - This is a transition with symbol pair number 2 
                    (i.e. <symbol number 3, symbol number 3>, i.e. a:a ) to the state 
                    whose transition indexes begin at transition index number 2 in the 
                    transition index table (i.e. state 1).

2. 0001 0000 0000 - This is a transition with symbol pair number 1 
                    (i.e. <symbol number 1, symbol number 2>, i.e. b:c ) to the state 
                    whose transition indexes begin at transition index number 0 in the
                    transition index table.

Explanation of the weighted runtime binary format

The weighted runtime binary format differs in two ways from the unweighted one as it has weights on transitions and on final states.

Weights on transitions

Each transition in the transition table has a weight. The entries in the transition table have the form

SYMBOL_PAIR_NUMBER TRANSITION_INDEX_NUMBER WEIGHT
A weight is encoded with float. For now, we use weights from the tropical semiring.

Weights on final states

The first entry in the array representing a state in the transition index table corresponding to a final state has the form

MAX_SHORT TRANSITION_NUMBER

The entry points to a transition of the form

0 0 WEIGHT
where WEIGHT is the final weight of the state. Note that a transition of this form cannot be confused with an actual transition as the indexing of symbol pairs begins at 1 -- not 0.

Note. The first entry in the array corresponding to a non-final state in the transition index table does not differ from the unweighted case.

An example of a weighted transducer

We do not give a very detailed description of the weighted runtime binary format as it is very similar to the unweighted format.

We give an annotated hexdump of a transducer small_test_weighted.fst. It was generated using the command sequence:

cat small_test_weighted.txt | hfst-txt2fst -w -A -a small_test_weighted.symbol_file | hwfst-write-runtime-transducer small_test.weighted.symbols > small_test_weighted.fst

The symbol file of the transducer

The symbol numbering of the runtime binary transducer small_test_weighted.fst.
0 <>
1 b
2 c
3 d
4 a
For more information, see the unweighted binary format explanation.

The transducer in text format

The text format of the transducer small_test.weighted.txt has two states 0 and 1.

0       1       <>      a       0
1       2.0
1       0       b       c       0.5
1       0       d       b       2.0
State 0 is the start state. It is non-final. State 1 is a final state with final weight 2.0.

There is one transition from state 0 to state 1 with symbol pair <>:a and weight 0. Ther are two transitions from state 1 to state 0. One with pair b:c and weight 0.5 and the another with pair d:b and weight 2.0.

The weighted runtime binary of the transducer

The places, which differ between the unweighted and weighted binary format are

  • in the header. The weighted truth_value is 1 instead of 0.
  • in the transition index table. The final-states have a finality transition instead of a two-valued final/non-final marker.
  • in the transition table. Each transition has a weight. There are finality transitions.
If something else seems unclear, you may wish to consult the unweighted binary format explanation.

annotated hexdump of file small_test_weighted.fst

0001 0000 - Byte order is the same as on this computer. 
0001 0000 - Weighted runtime binary format version 1.
0001 0000 - The transducer is determinisitic.
0001 0000 - The transducer is minimal.
0001 0000 - The transducer is cyclic.
0001 0000 - The transducer is weighted.

0005 - There are five symbols.
0003 - There are three input symbols.
0003 - There are three symbol pairs.

000a 0000 - Length of the transition index table is ten.
0004 0000 - Length of the transition table is four.

0. 0000 0000 - Symbol number 0 (corresponds to symbol <> as given in 
               small_example_weighted.symbols).

1. 0001 0000 - Symbol number 1 (corresponds to symbol b as given in 
               small_example_weighted.symbols).

2. 0002 0000 - Symbol number 2 (corresponds to symbol c as given in 
               small_example_weighted.symbols).

3. 0003 0000 - Symbol number 3 (corresponds to symbol d as given in 
               small_example_weighted.symbols).

4. 0004 0000 - Symbol number 4 (corresponds to symbol a as given in 
               small_example_weighted.symbols).

0. 0000 - Input symbol number 0 corresponds to symbol number 0 (<>).
1. 0001 - Input symbol number 1 corresponds to symbol number 1 (b).
2. 0003 - Input symbol number 2 corresponds to symbol number 3 (d).

1. 0000 0004 - Symbol pair number 1: (symbol number 0, symbol number 4), i.e. <>:a.
2. 0001 0002 - Symbol pair number 2: (symbol number 1, symbol number 2), i.e. b:c. 
3. 0003 0001 - Symbol pair number 3: (symbol number 3, symbol number 1), i.e. d:b.

0. ffff 0000 0000 - State 0 begins. A non-final state

1. 0000 0001 0000 - State 0 has a transition with input symbol number 0, i.e. symbol number 0,
                    i.e. symbol <>. Transitions with input symbol <> begin at position 1 
                    in the transition table.

2. ffff 0002 0000 - State 1 begins. A final state. The finality transition may be found at
                    position 2 in the transition table. 

3. 0000 0000 0000 - No transitions from state 1 with input symbol number 0 or from state 0 with
                    input symbol number 2.

4. 0001 0003 0000 - State 1 has a transition with input symbol number 1, i.e. symbol number 1,
                    i.e. symbol b. Transitions with symbol b begin at position 3 in the transition table.

5. 0002 0004 0000 - State 1 has a transition with input symbol number 2, i.e. symbol number 3,
                    i.e. symbol d. Transitions with symbol d begin at position 4 in the transition table.

6. 0000 0000 0000 - Empty transition index.

7. 0000 0000 0000 - Empty transition index.

8. 0000 0000 0000 - Empty transition index.

9. 0000 0000 0000 - Empty transition index.

1. 0001 0002 0000 0000 0000 - Transition with symbol pair number 1, i.e. <>:a, to transition index table
                              index 2. The transition weight is 0.0.

2. 0000 0000 0000 0000 4000 - Finality transition with weight 2.0.

3. 0002 0000 0000 0000 3f00 - Transition with symbol pair number 2, i.e. b:c, to transition index table
                              index 0. The transition weight is 0.5.

4. 0003 0000 0000 0000 4000 - Transition with symbol pair number 3, i.e. d:b, to transition index table
                              index 0. The transition weight is 2.0.

Binary Examples

The HfstBinaryFormatUnweightedExamples and HfstBinaryFormatWeightedExamples contain some examples of actual binary files.


-- MiikkaSilfverberg & KristerLinden - 09 Sep 2008
Topic revision: r36 - 2010-03-23 - KimmoKoskenniemi
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback