No HFST: Removed discussion on the other symbol

The discussion below was removed from HfstOtherSymbol because they are no more in the focus of our discussion and they could induce (unless updated) some inconsistencies with the latest account. Especially, other and any symbols may have different interpretations in different accounts. You may read the material below if you ensure that you can put the comments to their meant contexts.


Note 6.1. other in transducers and any symbols do have a common origin in regular expressions

We should not draw so clear opposition between identity-other in FSAs/FSTs and identity-any in regular expressions. The fact is that if we use identity-any as an isolated subexpression of regular expression, it corresponds to a one-transition automaton s0: identity-other->fs1. If the containing regular expressions is then evaluated in the bottom-up fashion through closure properties of automata, then the elementary automaton for identity-other and its regexp counterpart identity-any evolve to two different symbols. The regexp symbol is no more present in complex automata, while the set of symbols denoted by idenitity-other evolves into more particular set of symbols as the listed set of symbols in subexpressions becomes larger during the bottom-up evaluation. The same holds for irreflexive-other and irreflexive-any.

Note 6.2. Also any-pair or <.> and its FST-level counterpart have a common origin in regular expressions.

In AFST-20080214, the common origin of <.> in regular expressions and in transducers involves pair harmonization: AFST provides a meta symbol <.> (in addition to . in SFST). This stands for the unknown pair alphabet (the alphavet o SFST-PL programs) in regular expressions, and a symbol in transition labels. Its semantics is illustrated by the examples: <.> & ( a:b | c:d | <d>) = a:b | c:d <d> and a:<.> & <.>:b = a:b. Note that the semantics of <.> was implemented in regular expressions and transducers with no difference in the naming: symbol anysymb was used in all cases. In our current terminology in this article, the symbol any-pair became symbol other-pair when the transducer was constructed. The availability of any-pair i.e. =<.> in AFST allowed for the following kinds of programs:

$X$ = a:b <.> c d:<.> 
ALPHABET = a:b c d:e d:f
$X$:known
The operator :known assigned the contents of the define pair alphabet . for >.>, thus closing the alphabet. The above AFST script resulted to the following transducer whose pair alphabet is closed:
s1:   a:b -> s2
s2:   a:b -> 3,  c -> 3, d:e -> 3, d:f -> 3
s3:   c -> s4
s4:   d:e -> fs5,  d:f -> fs5
fs5.

-- AnssiYliJyra - 26 Sep 2008

NOTE6.2 In this discussion, we have not always differentiated between any symbol i.e. <.> and any pair <.>:<.>. This is a source for confusion. In fact, we should also consider the possibility of any-identity-pair i.e. ? in XFST regular expression.
-- AnssiYliJyra - 27 Sep 2008

NOTE6.3 There are many ways to convert any to other. Some of these ways postpone the expansion and use any or in transducers up to a point when we need to get rid of it (e.g. under relative complementation e.g. <.>:<.> - that results into a list of symbols).

When we are talking about any in regular expressions we do not assume that e.g. OpenFST could process it as a separate class of symbols with a key as usual. Instead, it should be harmonized by the wrapper and replaced with the set of known symbols and symbol other. There are two ways to do this: the any regexp is compiled with an other automaton and then the whole regular expression is compiled into automaton using the closure properties of automata. Alternatively, the alphabet of the regular expression is computed first, then it is propogated down to expression tree leaves where any occurs, and finally any subexpression is compiled with a union of all known symbols and the other symbol. Again, it is hard to see, how the meaning of any would changed. It is only the interpretation of the other symbol that undergoes the change.

There is also a way to see the any symbol as a metasymbol in automata. This approach tries to avoid the expensive route (i): compiling any in regular expressions with an other automaton and harmonizing the other transition in this automaton each time when its matrix automaton gains more characters, or another expensive route (ii): collecting the known symbols and compile any in regular expressions immediately in a way that is in harmony with the whole. The approach takes advantage of the wrapper's ability to open a third route: (i) compiling any in regexps to any in transducers, collecting the known symbol from the automaton and then harmonizing any using the other symbol and a list of known symbols. Accordingly, a part of the regexp compilation is implemented through harmonization that takes place at a point when the semantics of any have to be explicated. We could say, that any could be stay unharmonized under normal regular expression operations (union, concatenation, star), shuffle and composition while the explication and harmonization of this metasymbol becomes necessary under complementation and most rule operation.
--
AnssiYliJyra - 26 Sep 2008

NOTE7. The availability of metasymbols over known pairs and symbols is an important improvement to the size of intermediate results.

The availability of any in intermediate transducers is an important optimization. For example, XFST compiles [a | b | c | d | e | f | g | h] ?:? with a 56 transitions because the representation replaces any with identity-other, irreflexive-other, and the list of known pairs. With the any metasymbol, the result would have needed only 9 transitions.

s0:   a -> s1, b -> s1, c -> s1, d -> s1, e  -> s1, f -> s1.
s1:   ? -> fs2, a -> fs2, b -> fs2, c -> fs2, d -> fs2, e  -> fs2, f -> fs2, <a:b> -> fs2, <a:c> -> fs2, 
  <a:d> -> fs2, <a:e > -> fs2, <a:f> -> fs2, <a:?> -> fs2, <b:a> -> fs2, <b:c> -> fs2, <b:d> -> fs2, 
  <b:e > -> fs2, <b:f> -> fs2, <b:?> -> fs2, <c:a> -> fs2, <c:b> -> fs2, <c:d> -> fs2, <c:e > -> fs2,
  <c:f> -> fs2, <c:?> -> fs2, <d:a> -> fs2, <d:b> -> fs2, <d:c> -> fs2, <d:e > -> fs2, <d:f> -> fs2,
  <d:?> -> fs2, <e :a> -> fs2, <e :b> -> fs2, <e :c> -> fs2, <e :d> -> fs2, <e :f> -> fs2, <e :?> -> fs2, 
  <f:a> -> fs2, <f:b> -> fs2, <f:c> -> fs2, <f:d> -> fs2, <f:e > -> fs2, <f:?> -> fs2, <?:a> -> fs2, 
  <?:b> -> fs2, <?:c> -> fs2, <?:d> -> fs2, <?:e > -> fs2, <?:f> -> fs2, <?:?> -> fs2.
fs0:

My conclusion is that identity-other and irreflexive-other are less problematic with OpenFST etc. because they indicate disjoint sets of symbol pairs. However, as we already have a harmonization wrapper over OpenFST, it would be easy to allow the use of any as long as it does not cause problems with the naive semantics of the openFST.

-- AnssiYliJyra - 26 Sep 2008

Terminological choices
What is the right term choice: expansion vs. harmonization. The choice choild be motivated by prior use.

Expansion or extension:

  • ? adding new symbols to the alphabet
  • ? removing flag-diacritics and SFST's variables
  • replacing a transition labeled with symbol a from state 1 with a transition labeled with an extended pair (1,a) (Roche 1995, Yli-Jyrä 1995)
  • replacing brackets "<" and ">" occurring at a certain depth d in bracketed strings with an extended bracket that indicates the depth (d,<) (Yli-Jyrä 2004, 2005)
  • the above use meaning that some symbols are split i.e. ? can become {a,b,c,?}
  • this term suggests that it is reversible by a compression operation
    • complex symbols can always be mapped to simpler ones by homomorphism
    • in fact also splitted symbols can be replaced with a class
    • in both cases some information can be lost
  • these terms do not convey the purpose of extension

Harmonization:

  • making keys in labels of transducers comparable by rekeying (renumbering) the symbols used by the transducer (SFST does this sometimes when combining transducers)
  • making the symbol sets and their keys used in transducers comparable by splitting the symbol sets if needed and by rekeying the symbol sets
  • in this document meaning methods to make the symbols, symbol pairs, their sets and keys and keys of the sets comparable
    • Kimmo suggested a dichotomy: symbol harmonization and pair harmonization
    • Anssi suggested a taxonomy according to the stage of processing when harmonization is done: on-demand, prepared on-demand, preemptive, implicit. Such a taxonomy is not complete but covers most typical methods to construct transducers.
  • conveys well the purpose: harmony and consistence

The final commitment to the term choices needs to be done.

-- AnssiYliJyra - 26 Sep 2008

The roles of symbol and pair harmonizations

Key harmonization takes always place if the combined keys are not from the same key table. Key harmonization is a low-level harmonization that does not affect the semantics. Equivalence class harmonization is too complex to be discussed in this comment. I assumed above that each key is for one symbol. Pair harmonization is sufficient for most computations. It is also often efficient.

Symbol harmonization is needed when we want to convert between our FST representation and XFST representation. In XFST representation, the pairs of the transducer are not listed. Instead, only the symbols are listed and then the set of pairs consists of (i) pairs in transitions and (ii) identity pairs. Such a pair set conform the condition: if a:b is a known pairs then a:a and b:b are known pairs.

In sum, there is only a trade-off between

  • listing the known pairs (even those may not occur in transitions) and using other-pair (<.> of AFST and (??? check!) =:= of TWOL)
  • listing only the known symbols and useful pairs and using idenitity-other and irreflexive-other (? and ?:? in XFST)

Mathematically XFST-notation is more elegant because the pairset is closed under projection. We do not need to introduce missing identity pairs during projection. However, we could also consider a combination, because the three metacharacters could be interpreted by the preemptive harmonization.

-- AnssiYliJyra - 26 Sep 2008

Symbol Pair Calculus

It appears to me that it is slightly easier to consider first individual pairs and their combinations and then to go into the generalizations that iterate over all pairs in two transducers. I pursue toward such calculus of pais in this section. I call the calculus "symbol pair calculus".

Definition. Let E be the union of alphabets Esym and Emeta. Let Emeta be {other-pairs, idenitity-other-relation, irreflexive-other-relation}. The set of all abstract symbol pairs over E, P(E), is defined to contain the following:

  • not-available:not-available (used when an operation fails, e.g. if we complement <.> we should get nothing, i.e. the empty set)
  • other-pairs:other-pairs (used to refer to the complement of the known set of pairs)
  • identitity-other:identitity-other
  • irreflexive-other:irreflexive-other
  • a:a, a:other-pairs and other-pairs:a, a:irreflexive-other and irreflexive-other:a, where a∈ Esym
  • a:b where a,b∈ Esym.
No other pair a:b where a,b∈ E is an abstract symbol pair.

Definition. Let Pa, Pb, and Pc be sets of listed abstract symbol pairs over E. Let a:a'∈Pa, b:b'∈Pb and c:c'∈Pc. The following operations are called abstract symbol-pair operations. The result of the operations are sometimes set-valued.

  • C(a:a') where a:a'∈Pa (close the alphabet by removing =other=-like symbol)
  • NPb(a:a') where a:a'∈Pa (used to compute complement)
  • KPa(a:a') where a:a'∈Pa (applied when the set of possible pairs that need to be listed is fixed)
  • HPa,Pb(a:a') where a:a'∈Pa (needed in harmonization)
  • a:a' ∩ b:b' (needed in intersection and determinization)
  • a:a' ∪ b:b' (needed when computing equivalent symbol pairs or harmonized alphabets)
  • a:a' - b:b' (needed when computing difference)
  • a:a' xor b:b' (needed when computing exclusive or of languages)
  • a:a'  o  b:b' (needed in composition)
  • a:a'  o  b:b'  o  c:c' (needed in three-way composition that is more efficient than two compositions)
  • I(a:a') and O(a:a') (needed in projection)
  • I-1(a') and O-1(a) (needed in oriented contexts and surface coercion of GTWOL, useful also for computing crossproduct)
  • V(a:a') (needed in inversion)

-- AnssiYliJyra - 26 Sep 2008

-- AnssiYliJyra - 25 Sep 2008

About naming of symbols for two (main) kinds of unknown pairs

Many FST libraries implement symbol pairs as pairs without the identity constraint. Accordingly, the identity constraint is not supported by the libraries. Nevertheless, we can still find the solution through harmonization. The solution would be not to strive at two kinds of ?:? but to two kinds of ?. Instead of talking about how we obtain the symbol pair sets like other .x. other and Identity(other), we could talk about relations what they represent. Then we could denote such relation with a suitable symbol x and extend it into symbol pairs using the convention x=x:x.

Some standard definitions:

  • Function f is an injection_/_one-to-one if it maps each member in its domain to exactly one member in the co-domain. Function f is a surjection_/_onto if its range covers its whole co-domain. Function f is bijective_/_one-to-one and onto if it is both an injection and a surjection.
  • In an identity relation "R", every element of the set “A” is related to itself only. Identity relation is a reflexive, symmetric and transitive relation. It is used when each symbol are equivalent only to itself, meaning that there are no nontrivial blocks of equivalent symbols. The idenitity relation is a bijection.
  • In reflexive relation, "R", every element of the set “A” is related to itself. It is an extension of the idenitity relation.
  • A relation is equivalence relation if it is reflexive, symmetric and transitive at the same time. Equivalence relation is a natural choice when the alphabet consists of blocks of equivalent symblols.
  • Every idenitity relation is an equivalence relation and a bijection. Accordingly, we could call "?" in e.g. XFST as an identity relation, bijective equivalence relation, or the reflexive, symmetric and transitive bijection.

  • A relation over the set "A" is irreflexive if, no element in "A" is related to itself.
  • A relation is nontransitive, if it is neither transitive nor intransitive. If set E is small enough and U(..)=E, then relation over E represented by label ?:? in XFST can be intransitive (a:b, b:a?:?, but a:a not in ?:?). For a slight bigger E, ?:? is non-transitive (e.g. a:b, b:c, a:c?:?). If we assume that E is always big enough, we known that ?:? is a non-transitive relation. In addition, it is symmetric and irreflexive. Under the XFST interpretation, relations represented by ? and ?:? are disjoint but their union contains all pairs. Therefore, we can term ?:? as the complement of ? w.r.t. the set of all pairs of unknown symbols.

known notation interpretation metasymbol symbol pair needed in most implementations
? (XFST) or =:= (TWOL?) the largest identity relation over other symbols identity-other identity-other:identity-other
?:? (XFST) the largest irreflexive relation over other symbols irreflexive-other irreflexive-other:irreflexive-other

-- AnssiYliJyra - 25 Sep 2008

Comparison of different "anys" in different implementations.

"any" in regular expressions:

type of pairs XFST regexps AFST regexps SFST regexps TWOL regexps
non-identity pairs (when the set of pairs is not defined) ?:? <.> N/A  
identity pairs (when the set of pairs is not defined) ? N/A  
pairs of a known symbol and any other symbol (when the set of pairs is not defined) a:? ?:a a:<.> <.>:a N/A  
defined pairs containing a specified symbol (when the set of pairs is defined) a:. .:a a:. .:a  
non-identity pairs (when the set of pairs is defined) ?:? . .  
identity pairs (when the set of pairs is defined) ?  
Please help fill the last column!

Topic revision: r2 - 2008-10-03 - AnssiYliJyra
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback