HFST: Identity-Other and Nonidentity-Other interpreted as simple symbols

This page is an extension of page HfstOtherAsSimpleSymbol contributed by Kimmo Koskenniemi. This page is being completely rewritten elsewhere, in HfstOtherAsSimpleSymbol.

This page is devoted to a particular and simple interpretation for identity-other (alias = or =:=) and nonidentity-other (alias ? as a member of pairs such as ?:?, ?:a, and a:?) and the corresponding implementation and consequences. The interpretation is in short:

Overall properties

  1. "other" is interpreted against the known or listed pairs rather than known or listed symbols.
  2. The symbols = and ? make the set of listed pairs complete:
    • The symbol = is used for those pairs a:b that satisfy a=b and that have otherwise not been listed.
    • The symbol ? is used for those pairs a:b that satisfy a≠b and that have otherwise not been listed or covered by more restricted means.
    • If the set of listed pairs is closed under projection (a:b→a:a∧b:b), the interpretation of = is similar to transition label ? in XFST transducers.
    • If the set of listed pairs is also closed under Cartesian product of the input and output alphabets, the interpretation of ?:? is similar to transition label ?:? in XFST transducers.
  3. In FSTs, symbols = and ? are like any other symbols in their input and output alphabets. FSAs neither use ? nor pairs a:b that satisfy a≠b.
  4. The interpretation of symbols = and ? are external to the machine and thus not coded in the transition or elsewhere in the machine.
  5. The (pair) alphabet can contain pairs that are not used in transitions (e.g. as the result of the compilation of expression [^a] into FSA).

A Non-Trivial Observation

  • The output projection of relation { ?:? ?:A B:B } is { ? A B, B A B}.
    • In the relation { ?:? ?:A B:B }, symbol pair ?:A corresponds to all pairs a:A where a≠A, and
      symbol pair ?:? corresponds to all those pairs a:b, a≠b, not covered by pairs ?:A and B:B.
    • In particular, pair ?:? covers pairs ?:?, ?:B, and B:?, but neither ?:A (because this was listed) nor A:A (because ?:? covers only non-identity pairs).
    • This causes the production of two strings in the output projection.
  • Two things happen in in the alphabet of FST1 under composition FST1 o FST2.
    1. Because the class denoted by ?:? contains some pairs with known symbols, they need to be "extracted" from ?:? into pairs that contain such symbols in the output tape.
      • For example, if the pair alphabet of FST1 is { {?:?} {?:A} {B:B} }, it becomes extended as { {?:?,?:B} {?:A} {B:B} }.
      • This enforces the following property: The resulting pair alphabet does not contain pair ?:? or it contains a pair a:b only if it also contains pair ?:b.
      • If FSTs are represented in a tabular format where each column corresponds to different pairs used as labels, then = and ?:? could be interpreted as the otherwise cases. By iterating over all the possible pairs a:b and looking for the most specific column for the pair, we find transitions for every a:b. For example, if we have column set {{B:B}, {?:A} {?:?} } (note the order of columns), then C:C and C:B belong to the last column, but C:A belongs to the center column.
    2. The join of output alphabet of FST1 and input alphabet of FST2 is computed. For example, if the pair alphabet of FST1 is { {?:?,?:B} {?:A} {B:B} } and the pair alphabet of FST2 is { {A:A}, {B:B}, {C:C} }, the pair alphabet of FST1 becomes { {?:?, ?:C}, {?:B}, {?:A}, {B:B} }.

Operations of FSAs and FSTs

  1. Iteration, inversion and reversion of FSTs (and FSAs if applicable) are produced without special treatment of symbols = and ?.
  2. If the alphabets of FSAs including (=) or the pair alphabets of FSTs are identical (including = and pairs with ?), then
    1. the FST/FSA operations of union, concatenation, shuffle (and intersection if applicable) apply (without any special treatment of symbols = and ?).
    2. the FSA operations of complementation apply (without any special treatment of symbols =).
  3. (check!) Output (input) projection of FST can be computed if the pair alphabet of FST do not use pairs ?:? or it contains a pair a:b only if it also contains pair ?:b (b:?).
  4. Composition of FST1 and FST2 can be computed without special treatment of symbols = and ? if
    1. the output alphabet of FST1 and the input alphabet of FST2 do not use symbol ? and they otherwise agree, or,
    2. (check!) the pair alphabet of FST1 do not use pairs ?:? or it contains a pair a:b only if it also contains pair ?:b and
      the pair alphabet of FST2 do not use pairs ?:? or it contains a pair a:b only if it also contains pair a:?.
  5. If the alphabets of FSAs or FSTs fail to agree as above, the corresponding alphabets must be harmonized by adding some transitions for symbols from the other automaton as described below. The interpretations of symbols = and ? are thus deferred, partly to possible later operations where harmonization is needed, and partly to the execution of a FSA as a recognizer or FST as a transducer.

Benefits

  1. One can use the SFST and OpenFST calculi as they are, no reworking of the internals is necessary.
  2. It will be easier to include further finite-state engines in HFST in future.
  3. Two-level rules (and other relating transducers) will be freed from the requirement of a predefined set of allowed pairs.
  4. The core of the HFST interface will be simpler, more general, and straight-forward.
  5. All stages refer to finite and known characters. No need to operate with infinite or truly unknown symbols (in definitions and operations) until at run-time when it is no more a problem (just match the symbol not in the known alphabet with a ? transition and a = transition).
  6. XFST behaviour is close enough.
  7. A complete alphabet stays complete even after harmonization.
  8. Only one kind of harmonization. No separation with pair harmonization and symbol harmonization.
  9. The symbol ? has only one interpretation.
  10. More flexibility to combine composition and intersection topologies.
  11. Concrete pairs such as a:? often provide an efficient representation for the combinations of input symbol a with all known output symbols, whereas XFST needs always heavy quadratic listings.
  12. In contrast to XFST, does not assume that the pair alphabets are closed under projection, which gives the freedom to choose different input and output alphabets.
  13. When the known pairs are listed in the pair alphabet, there is no need to use transitions that combine known symbols but behave like = or like a ? -combination. This can reduce the number of transitions significantly in comparison to XFST.
  14. The disjunction of = and ? equals to the meaning of <.> in AFST, which conveys more compatibility.
  15. It would be possible and enjoyable to contruct proofs that show that the sufficient conditions are maintained during harmonization and operations.
  16. It seems that even complementation of two-tape FSAs (special FSTs) preserves the pair alphabet as compact and is thus efficient as required by the TWOL-level rule compilation methods.
  17. When we use two different symbols = and =?, we do not need to consider ? different from ?:? as done in XFST networks.
  18. When we use two different symbols = and =?, we can maintain the identity constraint of = even when used as =:= in labels of OpenFST and SFST transducers.
  19. The "ANY" symbol or "ANY:ANY" pair that are often needed in regular expressions compile from elementary expressions to an FSA with transition on symbol = and an FST with parallel transitions on pairs =:= and ?:?. Other uses of "ANY" are obtained by the correspondence between the operations in expressions and operation on networks.

Harmonization of FSAs

  • To harmonize FSA1, compute S12 as the difference of the concrete alphabet of FSA2 and FSA1 (counting = as a single symbol).
  • FOR EACH transition t from Qi to Qj in FSA1 with label =, add a transition from Qi to Qj with each label in S12.
  • Note that if there are no = transitions in FSA1, nothing will be added.
  • Harmonize FSA2 similarly.

Harmonization of FSTs for intersection, concatenation and union

  • To harmonize FST1, compute P12 as the difference of the concrete pair alphabet of FST2 and FST1 (counting =:&#61=s, =?:?, a:?, ?:a as single pairs).
  • FOR EACH transition t from Qi to Qj in FST1 with label =, add a transition from Qi to Qj with each label a:b∈P12, where a=b and a≠?.
  • FOR EACH transition t from Qi to Qj in FST1 with label a:b containing ?, add a transition from Qi to Qj with each label a:b∈P12, where a≠b or a=? or b=?.
  • Note that if there are no = in transitions of FST1, nothing will be added.
  • Harmonize FST2 similarly.

Harmonization of FSTs for output projection

  • Clarification: output projection removes the input tape and replaces it with a copy of the output tape (the result is an FST), or it simply removes the input tape (the result is an FSA). Input projection is defined in a similar way.
  • To harmonize FST1:
    1. Compute the set S1 of those symbols a that are output symbols of FST1 but whose pair ?:a is not in the pair alphabet of FST1. (For such symbols, the pair ?:? in the pair alphabet covers the combination ?:a.)
    2. FOR EACH transition t from Qi to Qj in FST1 with label ?:?, add a transition from Qi to Qj with each label ?:a, where a∈S1.
      • This is motivated by the fact that ?:? is interpreted externally by the harmonization.
      • Pair ?:? shoud act as ?:a unless ?:a is mentioned in the pair alphabet of FST1.
      • Note that if there are no ?:? in transitions of FST1, nothing will be added.
    3. FOR EACH transition t from Qi to Qj in FST1 with label a:?, add a transition from Qi to Qj with each label a:b where a≠b, a,b≠? and a:b was not in the pair alphabet of FST1 prior to this harmonization.
      • This is motivated by the fact that a:? stands for a:b if a:b is not explicitly listed in the pair alphabet.
      • Note that if there are no a:? in transitions of FST1, nothing will be added.

Harmonization of FSTs for composition

  • To harmonize FST1:
    1. Harmonize FST1 for output projection.
    2. Compute S21 as the difference of the concrete input alphabet of FST2 and output alphabet of FST1 (counting = as a single symbol).
    3. FOR EACH transition t from Qi to Qj in FST1 with label =:=, add a transition from Qi to Qj with each label a:a such that a∈S21.
      • Note that if there are no =:= in transitions of FST1, nothing will be added.
  • To harmonize FST2:
    1. Inverse FST1 and FST2, and exchange the names FST1 and FST2, harmonize FST1, exchange the names of FSTs again, inverse FSTs again.

Harmonization of FSTs for composition (Optimized)

  • To harmonize FST1:
    1. Compute S21 as the difference of the concrete input alphabet of FST2 and output alphabet of FST1 (counting = as a single symbol).
    2. Compute the set S1 of those symbols a that are output symbols of FST1 and input symbols of FST2 but whose pair ?:a is not in the pair alphabet of FST1.
    3. FOR EACH transition t from Qi to Qj in FST1 with label =:=, add a transition from Qi to Qj with each label a:a such that a∈S21.
      • Note that if there are no =:= in transitions of FST1, nothing will be added.
    4. FOR EACH transition t from Qi to Qj in FST1 with label ?:?, add transition from Qi to Qj with each ?:a such that a∈S1.
      • Note that if there are no ?:? in transitions of FST1, nothing will be added.
    5. FOR EACH transition t from Qi to Qj in FST1 with label a:?, where a≠?, add a transition from Qi to Qj with each label a:b where a≠b and a:b was not in the pair alphabet of FST1 prior to this harmonization.
  • To harmonize FST2:
    1. Inverse FST1 and FST2, and exchange the names FST1 and FST2, harmonize FST1, exchange the names of FSTs again, inverse FSTs again.

Debugging this page

This page is not fully proof-read. Please report any error, edit the page or add your comments below:


-- AnssiYliJyra - 28 Sep 2008
Topic revision: r9 - 2008-10-01 - AnssiYliJyra
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback