HFST: Other interpreted as a simple symbol

This page is devoted to one particular and simple interpretation for other (alias ?) and the corresponding implementation and consequences. The interpretation is in short:

  1. In FSAs and in FSTs the ? symbol is like any other symbol in their alphabet.
  2. There is no separate concepts for any and other, there is only the ? whose interpretation is external to the machine and thus not coded in the transition or elsewhere in the machine.
  3. If the alphabets, including the ? are identical in two FSAs, then the normal FSA operations apply (without any special treatment of ?.
  4. Iteration of FSTs are produced normally without special treatment of ?.
  5. If the input alphabets of two FSTs are identical (including the ?) and their output alphabets are identical, then the normal operations of union, concatenation (and intersection if applicable) apply normally, without special consideration of the ?.
  6. If the output alphabet of FST1 and the input alphabet of FST2 agree, then the composition can be computed without special treatment of the ?.
  7. If the alphabets of FSAs or FSTs fail to agree as above, the corresponding alphabets must be harmonized by adding some transitions for symbols from the other automaton as described below.
  8. The interpretation of ? is thus deferred, partly to possible later operations where harmonization is needed, and partly to the execution of a FSA as a recognizer or FST as a transducer.

Benefits

  1. One can use the SFST and OpenFST calculi as they are, no reworking of the internals is necessary.
  2. It will be easier to include further finite-state engines in HFST in future.
  3. Two-level rules (and other relating transducers) will be freed from the requirement of a predefined set of allowed pairs.
  4. The core of the HFST interface will be simpler, more general, and straight-forward.
  5. All stages refer to finite and known characters. No need to operate with infinite or truly unknown symbols (in definitions and operations) until at run-time when it is no more a problem (just match the symbol not in the known alphabet with a ? transition).

Harmonization of FSAs

  • To harmonize FSA1, compute S12 as the difference of the concrete alphabet of FSA2 and FSA1 (including possible =?=s as one symbol).
  • FOR EACH transition t from Qi to Qj in FSA1 with label ?, add a transition from Qi to Qj with each label in S12.
  • Note that if there are no ? transitions in FSA1, nothing will be added.
  • Harmonize FSA2 similarly.

Harmonization of FSTs for concatenation, union and intersection

In replace rules, the obvious default is the identity correspondence, because the rule only says about the center and the relevant context. Anything beyond the center must remain as it was. On the other hand, the identity correspondence is not useful in two-level rules. Once you enter identity correspondences, two rules will be in fatal conflict.

Thus we will introduce two kinds of harmonizations, one called identity harmonization for replace rules, and the other called pair harmonization e.g. for two-level rules. Both harmonize by adding transitions to accompany the original ? transitions, but slightly in a different way.

For identity harmonization, we expand ? transitions roughly as follows:

  • FOR ALL transitions (Qi ?:? -> Qj) IN FSA1, add transitions (Qi ?:? -> Qj) FOR EACH x IN the alphabet of FST2 NOT IN the alphabet of FST1.

For the pair harmonization, the expansion differs from the above so that we add pairs x:y (instead of identity pairs x:x) from the difference of pair alphabets of the two FSTs.

See the end of HfstOtherSymbol for additional details.


other in FSAs

Suppose we treat the other i.e. ? as an ordinary symbol. If two automata A and B have an identical alphabet of known symbols. Suppose also that both A and B use the ? for other symbols which do not have explicit transitions in them, e.g. A which accepts anything which does not start with an a, and B which accepts anything which does not start with b or bb.

A: 1. ? -> 2
   2: a -> 2,  ? -> 2

B: 1: ? -> 2
   2: ? -> 3
   3: b -> 3,  ? -> 3

Suppose we would make an intersection of A and B. The symbol b becomes known in A and the symbol a in B. We need to expand the ? with these:

A: 1: ? -> 2, b -> 2
   2: a -> 2,  ? -> 2, b -> 2

B: 1: ? -> 2, a -> 2
   2: ? -> 3, a -> 3
   3: b -> 3,  ? -> 3, a -> 3

Intersecting these two augmented automata needs no special treatment for ?:

A: 1: ? -> 2, b -> 2
   2: a -> 2,  ? -> 2, b -> 2

B: 1: ? -> 2, a -> 2
   2: ? -> 3, a -> 3
   3: b -> 3,  ? -> 3, a -> 3

A&B: 11: ? -> 22
     22: a -> 23, ? -> 23
     23: a -> 23, b -> 23, ? -> 23

Thus, it looks like the only thing is needed for FSAs with ? is that we harmonize the alphabets of the two FSAs by adding some transitions. All labels of the other automata which are not part of this automaton, need to be added as alternatives for each ? transition. After harmonizing we should be able to concatenate, union and intersect automata with ? symbols with no special handling for the ?.

other with FSTs

It seems likely that the same technique of expanding and harmonizing of FST alphabets prior to normal unions, concatenations and compositions would succeed just like it does with FSAs. The only problem appears to be how to interpret the other symbol (or ?) in FST contexts.

Replace rules need only to characterize the changing part and the context which is explicitly mentioned. By definition, replace rules leave everything else intact. Thus all unmentioned, i.e. unknown symbols are well tolerated and they correspond to the very same symbol (no matter what it was). With replace rules we hardly need a distinction between the input and output alphabets (both are open). We might expect the following concepts for other and sample representations using ? :

  1. ? is X:X where X is any symbol of the (open-ended input) alphabet not mentioned elsewhere in this FST.
  2. ?:a is X:a where X is any symbol of the alphabet not mentioned elsewhere in this FST. Any such symbol is mapped into a.
  3. a:? similarly, but a is mapped to all such symbols. (Less useful but possible use.)
  4. ?:? is less useful or possibly not needed at all. (Comments please!) If the cross-product interpretation of X:Y (where X and Y are any and possibly distinct symbols not mentioned) is not needed, ?:? in transitions could be used to denote ? as X:X (the above case 1).

A test case:

  1. a:? -> 2
  2. ?:b -> 3
  3:
What would this FST do with input "ab" and what with "ac"? Would it accept a relation "a:c c:b" (accept), "a:a b:b" (accept?) and/or "a:b a:b" (accept??)? (Comments please!) Note that a:? can only be in the center of a replace rule, not elsewhere. (Center is the part which is rewritten by the replace rule and the only part which might change.)

other with the identity constraint

Speech bubble If the only way we need to combine replace-rules is by composition, perhaps the interpretion X:X of ?:? is sufficient. Indeed this might render the other interpretation X:Y meaningless. If we want to use conjunction, the interpretation X:Y might become sensible, i.e. allow any replacements, that other rules may make, which don't effect characters mentioned in this rule. In this case composition and conjunction seem equivalent to me, though.

-- MiikkaSilfverberg - 24 Sep 2008

It seems that both other (other with identity) and other:other are both needed at least in replace and markup rules. The first use is obvious, but the second is motivated by the interaction between contexts that contain other and centers that contain other, but on different sides. In addition, both are needed in generalized two-level rules. However, this does not cancell other good ideas on this page.

other in FSTs which relate representations

Two-level rule grammars typically relate two representations. These representations may have different and even disjoint alphabets. It is less clear whether really unknown symbols are relevant. Often the input and output alphabets are fixed and known in advance. There is, still, a certain need for the concept of other symbol. The other symbol is more relative in such a context:

  1. ? and ?:? could likely mean the same thing, i.e. pairs X:Y which belong to the allowed set of pairs of input and output symbols but which have not been mentioned explicitly in this FST. Harmonizing could mean the introduction of transitions with those pairs known in the other FST which are unknown in this FST.
  2. ?:a could mean the pairs X:a in the allowed set of pairs not explicitly mentioned in this FST.
  3. a:? could mean the pairs a:X in the allowed set of pairs not explicitly mentioned in this FST.

-- KimmoKoskenniemi - 11 Sep 2008

I do not understand why two-level grammars should be considered as those used in morphology only. Such fixed notion of two-level grammars may be the common perception, but if we do not change this then no one else will do that. We need to look for algebraic generalizations such as our generalized two-level grammars (GTWOL) and bracketed GTWOL. When they are mentioned there is no more place for such arguments that "replace rules are more flexible" or that "they are superior to two-level rules". The fact is that replace rules cannot emulate GTWOL rules, but all replace rules can be implemented through BGTWOL rules. Thus, which on is more general and which reduces to which.

Said this, I also see the point in what you say: replace rules specify typically one contextual edit in the whole distortion of the input that happens when a string is fed to a cascade of e.g. replace rules implementing rules of generative phonology. Because the rule is responsible for only one alternation only, the unaltered parts are simply copied. The same does not hold for the whole cascade as the cascaded topology is only an alternative for the parallel topology of phonological grammars. As to the use of the other symbol, a single two-level rule assumes less on those portions in the string pair where the rule does not apply.

In particular, a two-level rule allows arbitrary changes in those parts where the rule does not apply, while a replace rule does the contrary.

-- AnssiYliJyra - 25 Sep 2008

Speech bubble If a rule mentions a:b, do all pairs a:X (where X hasn't been mentioned in the rule) belong to ?:? ? Consider the following two-level rule

a:b <= _ c ;
If the pairs a:X don't belong to ?:? then I think you could compile the rule into a transducer
1: ?   -> 1
   a:? -> 2
   ?:b -> 1
   ?:c -> 1
   c:? -> 1
   a:b -> 1
   c   -> 1
2: ?   -> 1
   a:? -> 2
   ?:b -> 1
   b:? -> 1
   c:? -> 1
   ?:c -> 1
   a:b -> 1
Is it possible to compile this rule without referring to each pair a:X individually, unless ?:? excludes a:X (perhaps this is a minor consideration)? Can this effect harmonization? Actually, one can compile the rule simply
1: ?   -> 1
   a:b -> 1
   a:? -> 2
   c   -> 1
2: a:b -> 1
   a:? -> 2
   ?   -> 1
since ?:? will match exactly those pairs, that haven't been referred to in the fst, i.e. all pairs except a:b, c and a:? (among others all b:? and ?:b pairs [ except a:b ] ).

-- MiikkaSilfverberg - 24 Sep 2008

Conclusion

The other symbol perhaps could be treated just like any normal symbol in FST operations, if we adopt an alphabet harmonization as a necessary preliminary step. There are two kinds of harmonization:

  • a symbol harmonization for replace FSTs which expands identity pairs and other pairs of symbols relative to the (open ended) alphabet, and
  • a pair harmonization for two-level rules or other relation, which expands relative to the set of allowed pairs of symbols.

Both types of harmonization split the various kinds ?:?, a:? and ?:a transitions by adding concrete transitions (starting and ending in the same states as these other transitions). Suppose we have FSTs A and B and their sets of label pairs are P(A) and P(B) respectively. Furthermore, let I(A), I(B) and O(A), O(B) stand for the input and output alphabets of A and B, and S(A) and S(B) stand for the common part of the input and output alphabets, i.e. I(A) ∩ O(A) (but usually the input and output alphabets are the same). In these alphabets, we treat ? as an ordinary symbol in the input or output alphabet. If the alphabet does not contain ?, then it is closed, and harmonization will not expand transitions of that FST.

The pairs to be added to A in symbol harmonization are as follows (Please, comment and check!):

  • for each ? transition, add: x:x FOR ALL x IN S(B) - (S(A) ∪ ?)
  • for each a:? transition, add a:x FOR ALL x IN O(B) - (O(A) ∪ ?)
  • for each ?:a transition, add x:a FOR ALL x IN I(B) - (I(A) ∪ ?)

In pair harmonization, we add to A (Please, comment and check!):

  • for each ?:? transition, add x:y FOR ALL x:y IN P(B) - (P(A) ∪ ?:?)
  • for each a:? transition, add a:x FOR ALL a:x IN P(B) - (P(A) ∪ ?:?)
  • for each ?:a transition, add x:a FOR ALL x:a IN P(B) - (P(A) ∪ ?:?)

For the purposes of composition A .o. B, one might need and define some further harmonizations. Harmonization for composition could assert that O(A) and I(B) would be mutually extended according to symbol harmonization in the above manner. -- KimmoKoskenniemi - 11 Sep 2008

I support Kimmos proposal on two harmonizations (symbol and pair harmonization), but in the sense that they are actually restrictions of a more general harmonization that copes with both. Krister and Miikka have also made great work in definitions and commenting!

Kimmo's proposal might lead to inquiring what metasymbols are involved in each compilation formula (for replace, twol rules etc.). -- AnssiYliJyra - 26 Sep 2008


-- KimmoKoskenniemi - 28 Sep 2008
Topic revision: r2 - 2008-10-01 - AnssiYliJyra
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback