HFST: API Proposal

The big picture of symbols, pairs and their sets:

The following table summarizes crucial ideas:

  1. flexibility in transition label packing
  2. caching in symbol storing
  3. shared number space for cache hanles i.e. symbol, set, pair and pair set identifiers
  4. user-definable numbers for equivalence are not keys of caches

229..230-1            
implemented

HFST 3.1


HFST 2.1
SFST

HFST 1.0
SFST (unit sets)
OpenFST, XFST, Flex, ...
HFST 2.0

XFST (lists)
HFST 3.0
SFST

HFST 1.0


HFST 1.0
domain
names
PairClassId PairId Pair EqClassId
(multiple interpretations
possible)
SymClassId SymId Sym
caches
shared
by all
in core
  PairId↔Pair
"pair cache"
  SymId↔Sym
"symbol cache"
PairClassId→2PairId
"set cache"
  SymClassId↔2SymId
"set cache"
 
α-bet
-dependent
interpretation
  α-bet → 2EqClassId
the eqclassids in use
i.e. set EqClassId'
α-bet → 2SymId
symids whose eqclass ≠ -1
 
  α-bet → ( EqClassId'SymClassId )  
  α-bet → ( SymIdEqClassId' )  
  α-bet → ( SymEqClassId', i.e. "OpenFST symbol table" )
the unabridged on-disk representation
transducer
specific
2PairClassId   2Pair
"labelset in SFST"
 
value
type
ids id pairs ids strings
range or
examples
SymId SymId (a,a) where
a is a symid
N/A SymId 230..231-1 "\p<>"
"<>"
"\\"
"\x20"
etc.
SymClassId SymClassId { (a,a) | a ∈ A }
where A is a symclassid
N/A 227..228-1    
EqClassId EqClassId { (a,a) | a ∈ A }
where A is
a symclassid
or an eqclassid
0..227-1
fixed in each α-bet
     
PairId 228..229-1
A x B where A
and B are
eq/symclassids
       

Notes on motivation:

  1. Flexibility in transition label packing. A set of transitions from state q to state r on a set of symbols can be represented in different ways:
    1. storing a transition per a symbol pair
    2. storing a transition per a class of equivalent symbols
    3. storing transitions whose labels are sets covering the set of symbols
All these representations are best for some purposes and worse for the other; All these representations are used in iIndustrial systems. By placing all these modes into a seamless model we obtain additional flexibility by allowing different representations in different algorithms and FST implementations.
  1. Caching of symbols. Symbols can be large objects, e.g. bit vectors or multi-character symbols. It is not wise to store them separately to every alphabet. Dynamic memory allocation for such objects can be taken over by caches that store a unique copy of each object and return an identifier that is used to refer to it.
  2. Shared number space for cache hanles i.e. symbol, set, pair and pair set identifiers. It is possible to allocate ranges of identifiers for different kinds of objects in the caches. The current proposal is one of the possible implementations. It maintains the lowest numbers to equivalence classes in OpenFST symbol tables/alphabets. Large portion of the number space is still unused. (Yet OpenFST suggests the number space of 64-bits, while SFST is using, by default only 16-bits.)
  3. User-definable numbers for equivalence are not keys of caches. OpenFST symbol tables/alphabets have user-definable numbers for equivalence classes of symbols. The numbering can vary from an alphabet to an alphabet. The current account would support them while it would separate them from the globally defined identifiers of to the same sets of symbols.

Complexity of Basic Operations on Transition Labels of Different Complexities

If the generalizations are not exploited in algorithms, the only essential cost is the cost of using cached pairs during composition through the pair cache (2-3 memory fetches) instead of locally stored pairs (1 memory fetch).

The following table summarizes the complexity of operations for each type when the operands are in the cache but their combinations are not. If combinations are already seen, cache can return the answer in O (1) time.

cross-product O (1) O (1) O ( (|a| x |b|) log (|a| x |b|) ) O ( (|a| x |b|) log (|a| x |b|) ) N/A
operation   time complexity
  Sym Pair EqClass SymClass PairClass
insertion O (1) O (1) O (|a| log |a|) O (|a| log |a|) O (|a| log |a|)
deletion (useless) O (1) O (1) O (|a|) O (|a|) O (|a|)
lookup O (1) O (1) O (|a| log |a|) O (|a| log |a|) O (|a| log |a|)
equivalence O (1) O (1) O (1) O (1) O (1)
membership
projection
O (1) O (1) O (|a|) or O (1) O (|a|) or O (1) O (|a|) or O (1)
intersection
union
difference
composition
O (1) O (1) O (1) O ( (|a| + |b|) log (|a| + |b|) ) O ( (|a| + |b|) log (|a| + |b|) )

In fact, vectorized finite-state automata are an efficient representation for a POSET ordered calculus of symbol sets as finite-state labels (Kornai 1996). To support such a calculus, we should not assume that a∩b where a,b ∈ Sym is a≡b. The intersection of labels should be definable in the symbol cache, whose responsibility is to provide a function for intersecting two symbols (that can be bit vectors rather than strings as we normally assume).

In SFST, pairs are stored explicitly to the transition labels, and transition lists are not kept in order. OpenFST should be reviewed. If PairClassIds are used as transition labels, it is possible to pack sets of transitions locally even if they are not forming an equivalence class in the alphabet. For some types of automata, this can reduce the size of transition lists, which speeds up computations. In terms of big-O, the complexity is the same in the following cases (1) transition lists contain their members in a sorted order (presumably the OpenFST approach), (2) sorted lists are stored in cache (the proposed direction) and merged when necessary.

It is very important that the complexity of intersection, union and difference of two lists of sets is linear to the total number of members in these sets if the sets are sorted. The indicated complexity of composition of two sets of pairs requires that they are sorted according to both dimensions: upper and lower syms, i.e. they must be stored twice as sorted or sorted on the fly. Cross-product is always expensive since it produces large sets of pairs. Mnemonization of operations would give O (1) time complexity for all aperations when computed again.

Example of combinations when computing intersection of sets denoted by two ids. The types are in disjunctive order from the most specific to the least specific. The cell with the best available complexity applies. Some cells contain two complexities. The first refers to an implementation with lists and the second with a hash. &empty refers to an empty symbol set.

PairClass b ∈ a ? b : ∅
O (|a|) or O (1)
b ∈ a ? b : ∅
O (|a|) or O (1)
{ c : c ∈ a ∧ c ∈ b }
O (|a| + |b|)
{ c : c ∈ a ∧ c ∈ b }
O (|a| + |b|)
{ c : c ∈ a ∧ c ∈ b }
O (|a| + |b|)
a ∩ b b
Sym Pair EqClass
HFST 2.0
SymClass
HFST 3.0
PairClass
HFST 3.1
a Sym a ≡ b ? a : ∅
O (1)

O (1)
a ∈ b ? a : ∅
O (1)
a ∈ b ? a : ∅
O (|b|) or O (1)
a ∈ b ? a : ∅
O (|b|) or O (1)
Pair
O (1)
a ≡ b ? a : ∅
O (1)

O (1)
a ∈ b ? a : ∅
O (|b|) or O (1)
a ∈ b ? a : ∅
O (|b|) or O (1)
EqClass b ∈ a ? b : ∅
O (1)

O (1)
a ≡ b ? a : ∅
O (1)
{ c : c ∈ a ∧ c ∈ b }
O (|a| + |b|)
{ c : c ∈ a ∧ c ∈ b }
O (|a| + |b|)
SymClass b ∈ a ? b : ∅
O (|a|) or O (1)
b ∈ a ? b : ∅
O (|a|) or O (1)
{ c : c ∈ a ∧ c ∈ b }
O (|a| + |b|)
{ c : c ∈ a ∧ c ∈ b }
O (|a| + |b|)
{ c : c ∈ a ∧ c ∈ b }
O (|a| + |b|)

Discussion

NOTE 2 by KL: As noted above, equivalence classes may look like a considerable space saver, but the equivalence classes in a weighted transducer will in general be relatively fine-grained, which means that there will in general be no major difference between the number of equivalence classes and symbol handles, i.e. having the same info twice is mostly a waste of space.

NOTE 2.1 by AY: Some space is really lost to symbol handles when there is only one set of equivalence classes (alphabet) in the memory. However, if there are several transducers that do not share exactly the same alphabet, this also pays back because the common symbols (the name strings) in the alphabets are stored only once. In the legacy implementation (SFST::Alphabet), the symbols (the name strings) are stored in every copy of the same alphabet.

NOTE 2.2 by AY: When transducers have been constructed using a user-given alphabets, they often share at leat one of the alphabets. When two transducers then composed with tapes that use the same alphabets (including the keys for the equivalence classes), there is no need to harmonize them or even to read the alphabets.

Prior Discussion

There is one global SymbolSet, Σ, i.e. the symbol table, into which symbols can be inserted at given position numbers or at the first free position number. The epsilon symbol is the constant epsilon_symbol located at position number 0. Print names for epsilon must be specified by the user. If no print name is specified for a symbol, the position number is used as a print name. An alphabet of symbol pairs is a SymbolPairSet, Π. Only basic operations are provided on sets of symbol pairs. For more complex manipulations, see Transducer operations.

Objections (1/9/2008 by AY):
  1. All FST operations (such as generalized restriction, complementation etc.) can either assume it as an operand or use open alphabets if such are supported.
  2. When we move beyond character alphabets to parsing and disambiguation with FSTs, we need an alphabet with an unknown symbol.
  3. The global symbol set is the matter of the application such as a rule compiler.
  4. There are several alphabets. For example, joiners and other auxiliary symbols are not normally included to the total external alphabet. This makes the definitions of Σ and Π unclear.
Secondments (1/9/2008 by AY):
  1. We assume that 0 is special number as a symbol identifier (SymId) or equivalence class identifier (EqClassId). However, identifiers should not be compared with 0 by the users. Leave it to alphabet.

We may, at our own peril, assume that internally any two stored transducers refer to the same symbol table and use the symbol numbers directly. The shortcut using symbol numbers directly is no doubt more speedy than a symbol table look-up if we are running a cascade of transducers. In order to guarantee compatibility between two stored transducers at runtime, we may need to harmonize their symbol tables during compile-time.

Secondment (1/9/2008 by AY):
  1. This is now supported by the "big picture" through EqClassIds.
  2. In HFST 3.0, each equivalence class in the symbol table would be represented in the memory through an equivalence to a cached symbol set. This is part of harmonization, but happens inside the symbol table, with no effects to the labels in transducers.
  3. Further harmonization may take place when an EqClassId combined under an operation that is not closed under the set of equivalence classes of the alphabet.
Remark (1/9/2008 by AY):
  1. OpenFST pipes allow not only an input and output alphabets, but a lot of intermediate alphabets. In Xerox, there are published methods that optimize intermediate alphabets. Thus, an assumption about a global alphabet is quite strong, although it often happens that alphabets are comparable or equal.
  2. XFST assumes that each network knows an network-specific subset of all alphabets. However, the input and output alphabets are assumed to be the same.
  3. We may have networks with incomprable alphabets whose numbers are not in harmony. We need to cope with them while
    1. being fast when harmonization is not necessary
    2. being able to cache things when necessary

There is a division of labour (proposed 1/9/2008 by AY):

  1. There are caches (or pools) for various kinds of symbols and combinations, but caches are not stored on disk. Caches store all used symbols and combinations of symbols only once in the memory.
  2. I/O functions decode and encode symbols to and form a symbol cache. The for purposes of OpenFST compatibility, symbol cache stores e.g. strings as symbols in a canonized form with the following conventions:
    1. the canonized forms could be used in OpenFST symbol table files without problems (explicit display format and valid file format)
    2. \ is an escape character, the following escape sequences are supported: \0 \r \n \a \b \v \f \d \s (non-standard quotation for ASCII 32), \x (hex char) \e (ascii escape) \p (HFST protection)
    3. \p-prefix is used to protect some internal symbols (e.g. <>) from their equivalent looking counterparts that are available to the users
    4. all ASCII control characters (0-31) are quoted
    5. the quotes are decoded first in order to obtain byte strings; the decoded byte string can be a UTF-8 string or any other type of byte string.
  3. An alphabet (or *symbol table) contains classes of symbols.
    1. These classes are extended in HFST 2.0 to sets that are not allowed to overlap inside an alphabet.
    2. The design and purpose of alphabets is similar to that of OpenFST symbol tables, but possibly more decomposed in the memory.
    3. The alphabet must remember all symbols mentioned in the file. There is no way to tell in OpenFST that a certain class is the class for unknown symbols. Therefore, it is better that the alphabet maps unknown symbols to equivalence class -1. This class is then treated later in the transducer either by throwing it away or by collapsing it with an existing equivalence class.
  4. Each transducer has an input and an output alphabet that can be unknown
    1. If the alphabets are unknown then the same set of equivalence classes is assumed when tapes from two transducers are combined under an operation.
    2. If the alphabet is not integrated to the same file, it must be possible to separately load an alphabet.
    3. When an alphabet is known, then:
      1. It implements an equivalence relation over the symbols in the symbol cache; the equivalence relation is specific to each alphabet
      2. It implements a mapping from the symbols stored to the symbol cache to the equivalence class identifier used by the transducer
      3. it helps to relate the equivalence classes in the transducer to symbol sets in other transducers through the members of the sets
    4. if we need to reduce the number of alphabets in some situations, we have alternative solutions:
      1. concatenate the alphabets by renumbering the other alphabet (this can break the injection from symbols to classes!)
      2. use same symbol classes and positions according to priority union
      3. like OpenFST, add the missing symbols from the second alphabet to the first one with their original symbol class number (this assumes that symbols classes are shared, but just incompletely specified)
      4. assume that the block cannot become bigger and create a new harmonized symbol table with all symbols in refined blocks
    5. it is not valid in general to assume that the input and the output alphabets are unifiable if they are not given as identical
  5. The attributes of every transducer specify which symbols have a special use in the transducer.
    1. If there is an equivalence class of that is to be unified with equivalence class -1 (unknown symbols), this is specified by an attribute. Some transducer formats may contain preset valuest for this attribute (e.g. Karttunen 1990 assumes SymId denotes all unknown symbols).
    2. The failure interpretation of a symbol class is optional.
  6. In compilers such as LEXC and SFST, there is a transducer whose role is to present the universal language over the used symbol pairs, and the corresponding pair alphabet and its projections.

Some observations with regard to the current approach:

  1. The conceptual difference between a Symbol and its position is that a symbol is a combination of a set of print names with a common position in the symbol table.
AY: Instead of postulating parallel print names, we would perhaps like to talk about an equivalence class of symbols.
  1. FST libraries such as OpenFST allow different alphabets for the input and the output side.
  2. The symbol pairs in use in a transducer or in an application such as the LEXC or SFST compiler would form SymbolPairSets that have iterators for SymbolPairs, but it would also be nice to have separate iterators for input and output symbols like in the current alphabet.h.

NOTE. Version 2.1

  1. The other symbol is the constant other_symbol, which represents any symbol not mentioned in the symbol table. The treatment of the other_symbol (cf. Xerox other symbol) will be further specified.

NOTE. Version 3.0

  1. Different kinds of complex objects (strings, bit vectors, sequences, regular expression, ...) can be inserted as Symbols provided that the symbol table of the implementation supports such object types. However, after they have been inserted in the symbol table, HFST treats such complex symbols like atomic symbols, i.e. it does not interpret the internal structure of such objects.
  2. We need definitions for attributes of transducers. See Karttunen (Binary encoding format for finite-state networks, SSL-90-17) for some standard attributes.
Topic revision: r3 - 2009-09-30 - ErikAxelson
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback