The following table summarizes crucial ideas:
2^{29}..2^{30}1  
implemented  HFST 3.1 
HFST 2.1 
SFST HFST 1.0 
SFST (unit sets) OpenFST, XFST, Flex, ... HFST 2.0 
XFST (lists) HFST 3.0 
SFST HFST 1.0 
HFST 1.0 

domain names 
PairClassId  PairId  Pair  EqClassId (multiple interpretations possible) 
SymClassId  SymId  Sym  
caches shared by all in core 
PairId↔Pair "pair cache" 
SymId↔Sym "symbol cache" 

PairClassId→2^{PairId} "set cache" 
SymClassId↔2^{SymId} "set cache" 

αbet dependent interpretation 
αbet → 2^{EqClassId} the eqclassids in use i.e. set EqClassId' 
αbet → 2^{SymId} symids whose eqclass ≠ 1 

αbet → ( EqClassId' → SymClassId )  
αbet → ( SymId → EqClassId' )  
αbet → ( Sym → EqClassId', i.e. "OpenFST symbol table" ) the unabridged ondisk representation 

transducer specific 
2^{PairClassId}  2^{Pair} "labelset in SFST" 

value type 
ids  id pairs  ids  strings  
range or examples 
SymId  SymId  (a,a) where a is a symid 
N/A  SymId  2^{30}..2^{31}1  "\p<>" "<>" "\\" "\x20" etc. 

SymClassId  SymClassId  { (a,a)  a ∈ A } where A is a symclassid 
N/A  2^{27}..2^{28}1  
EqClassId  EqClassId  { (a,a)  a ∈ A } where A is a symclassid or an eqclassid 
0..2^{27}1 fixed in each αbet 

PairId  2^{28}..2^{29}1 
A x B where A and B are eq/symclassids 
Notes on motivation:
All these representations are best for some purposes and worse for the other; All these representations are used in iIndustrial systems. By placing all these modes into a seamless model we obtain additional flexibility by allowing different representations in different algorithms and FST implementations.
If the generalizations are not exploited in algorithms, the only essential cost is the cost of using cached pairs during composition through the pair cache (23 memory fetches) instead of locally stored pairs (1 memory fetch).
The following table summarizes the complexity of operations for each type when the operands are in the cache but their combinations are not. If combinations are already seen, cache can return the answer in O (1) time.
crossproduct  O (1)  O (1)  O ( (a x b) log (a x b) )  O ( (a x b) log (a x b) )  N/A  
operation  time complexity  
Sym  Pair  EqClass  SymClass  PairClass  

insertion  O (1)  O (1)  O (a log a)  O (a log a)  O (a log a)  
deletion (useless)  O (1)  O (1)  O (a)  O (a)  O (a)  
lookup  O (1)  O (1)  O (a log a)  O (a log a)  O (a log a)  
equivalence  O (1)  O (1)  O (1)  O (1)  O (1)  
membership projection 
O (1)  O (1)  O (a) or O (1)  O (a) or O (1)  O (a) or O (1)  
intersection union difference composition 
O (1)  O (1)  O (1)  O ( (a + b) log (a + b) )  O ( (a + b) log (a + b) ) 
In fact, vectorized finitestate automata are an efficient representation for a POSET ordered calculus of symbol sets as finitestate labels (Kornai 1996). To support such a calculus, we should not assume that a∩b where a,b ∈ Sym is a≡b. The intersection of labels should be definable in the symbol cache, whose responsibility is to provide a function for intersecting two symbols (that can be bit vectors rather than strings as we normally assume).
In SFST, pairs are stored explicitly to the transition labels, and transition lists are not kept in order. OpenFST should be reviewed. If PairClassIds are used as transition labels, it is possible to pack sets of transitions locally even if they are not forming an equivalence class in the alphabet. For some types of automata, this can reduce the size of transition lists, which speeds up computations. In terms of bigO, the complexity is the same in the following cases (1) transition lists contain their members in a sorted order (presumably the OpenFST approach), (2) sorted lists are stored in cache (the proposed direction) and merged when necessary.
It is very important that the complexity of intersection, union and difference of two lists of sets is linear to the total number of members in these sets if the sets are sorted. The indicated complexity of composition of two sets of pairs requires that they are sorted according to both dimensions: upper and lower syms, i.e. they must be stored twice as sorted or sorted on the fly. Crossproduct is always expensive since it produces large sets of pairs. Mnemonization of operations would give O (1) time complexity for all aperations when computed again.
Example of combinations when computing intersection of sets denoted by two ids. The types are in disjunctive order from the most specific to the least specific. The cell with the best available complexity applies. Some cells contain two complexities. The first refers to an implementation with lists and the second with a hash. &empty refers to an empty symbol set.
PairClass  b ∈ a ? b : ∅ O (a) or O (1) 
b ∈ a ? b : ∅ O (a) or O (1) 
{ c : c ∈ a ∧ c ∈ b } O (a + b) 
{ c : c ∈ a ∧ c ∈ b } O (a + b) 
{ c : c ∈ a ∧ c ∈ b } O (a + b) 


a ∩ b  b  
Sym  Pair  EqClass HFST 2.0 
SymClass HFST 3.0 
PairClass HFST 3.1 

a  Sym  a ≡ b ? a : ∅ O (1) 
∅ O (1) 
a ∈ b ? a : ∅ O (1) 
a ∈ b ? a : ∅ O (b) or O (1) 
a ∈ b ? a : ∅ O (b) or O (1) 
Pair  ∅ O (1) 
a ≡ b ? a : ∅ O (1) 
∅ O (1) 
a ∈ b ? a : ∅ O (b) or O (1) 
a ∈ b ? a : ∅ O (b) or O (1) 

EqClass  b ∈ a ? b : ∅ O (1) 
∅ O (1) 
a ≡ b ? a : ∅ O (1) 
{ c : c ∈ a ∧ c ∈ b } O (a + b) 
{ c : c ∈ a ∧ c ∈ b } O (a + b) 

SymClass  b ∈ a ? b : ∅ O (a) or O (1) 
b ∈ a ? b : ∅ O (a) or O (1) 
{ c : c ∈ a ∧ c ∈ b } O (a + b) 
{ c : c ∈ a ∧ c ∈ b } O (a + b) 
{ c : c ∈ a ∧ c ∈ b } O (a + b) 
NOTE 2 by KL: As noted above, equivalence classes may look like a considerable space saver, but the equivalence classes in a weighted transducer will in general be relatively finegrained, which means that there will in general be no major difference between the number of equivalence classes and symbol handles, i.e. having the same info twice is mostly a waste of space.
NOTE 2.1 by AY: Some space is really lost to symbol handles when there is only one set of equivalence classes (alphabet) in the memory. However, if there are several transducers that do not share exactly the same alphabet, this also pays back because the common symbols (the name strings) in the alphabets are stored only once. In the legacy implementation (SFST::Alphabet), the symbols (the name strings) are stored in every copy of the same alphabet.NOTE 2.2 by AY: When transducers have been constructed using a usergiven alphabets, they often share at leat one of the alphabets. When two transducers then composed with tapes that use the same alphabets (including the keys for the equivalence classes), there is no need to harmonize them or even to read the alphabets.
There is one global SymbolSet, Σ
, i.e. the symbol table, into which symbols can be inserted at given position numbers or at the first free position number. The epsilon symbol is the constant epsilon_symbol
located at position number 0
. Print names for epsilon must be specified by the user. If no print name is specified for a symbol, the position number is used as a print name. An alphabet of symbol pairs is a SymbolPairSet, Π
. Only basic operations are provided on sets of symbol pairs. For more complex manipulations, see Transducer operations.
Objections (1/9/2008 by AY):Secondments (1/9/2008 by AY):
 All FST operations (such as generalized restriction, complementation etc.) can either assume it as an operand or use open alphabets if such are supported.
 When we move beyond character alphabets to parsing and disambiguation with FSTs, we need an alphabet with an unknown symbol.
 The global symbol set is the matter of the application such as a rule compiler.
 There are several alphabets. For example, joiners and other auxiliary symbols are not normally included to the total external alphabet. This makes the definitions of
Σ
andΠ
unclear.
We may, at our own peril, assume that internally any two stored transducers refer to the same symbol table and use the symbol numbers directly. The shortcut using symbol numbers directly is no doubt more speedy than a symbol table lookup if we are running a cascade of transducers. In order to guarantee compatibility between two stored transducers at runtime, we may need to harmonize their symbol tables during compiletime.
Secondment (1/9/2008 by AY):Remark (1/9/2008 by AY):
 This is now supported by the "big picture" through EqClassIds.
 In HFST 3.0, each equivalence class in the symbol table would be represented in the memory through an equivalence to a cached symbol set. This is part of harmonization, but happens inside the symbol table, with no effects to the labels in transducers.
 Further harmonization may take place when an EqClassId combined under an operation that is not closed under the set of equivalence classes of the alphabet.
 OpenFST pipes allow not only an input and output alphabets, but a lot of intermediate alphabets. In Xerox, there are published methods that optimize intermediate alphabets. Thus, an assumption about a global alphabet is quite strong, although it often happens that alphabets are comparable or equal.
 XFST assumes that each network knows an networkspecific subset of all alphabets. However, the input and output alphabets are assumed to be the same.
 We may have networks with incomprable alphabets whose numbers are not in harmony. We need to cope with them while
 being fast when harmonization is not necessary
 being able to cache things when necessary
There is a division of labour (proposed 1/9/2008 by AY):
Some observations with regard to the current approach:
AY: Instead of postulating parallel print names, we would perhaps like to talk about an equivalence class of symbols.
alphabet.h
.
NOTE. Version 2.1
other_symbol
, which represents any symbol not mentioned in the symbol table. The treatment of the other_symbol
(cf. Xerox other symbol) will be further specified.
NOTE. Version 3.0
TWiki Reference