This page is for discussing proposals for handling special symbols and alphabet-affecting operations in HFST.
I have only few general ideas to record: * The user-facing tools should work on principle of least surprise, e.g. if you ask to remove symbol a totally from an automaton, there should be no way for automaton to match that a anymore * The internal symbols should not be end-user's (as in, command-line user's) concern, Internal characters that work magically need to be supported under the hood and prevented from leaking into automata that are ever visible... with the unfortunate exception of flag diacritics * The flag diacritic logic should still be transparented from the user: for each operation OP(t) working on automaton t, it must be that eliminate-flags(OP(t)) = OP(eliminate-flags(t)), and equally for binary operators etc.. Specifically OP : lookdown needs to be like such.

-- TommiPirinen - 2013-02-08


Earlier, when harmonizing a transducer with another transducer that had special symbols (@_.*_@) , the special symbols were used for expanding eg. @_UNKNOWN_SYMBOL_@ during harmonization. It was proposed that this was incorrect, and that such special symbols should, by default, not be used for expanding. They should, however be inserted into the alphabet. This bug is now fixed.

Freely insert

Scenario: a transducer, A, doesn't have "b" in its alphabet. We call
on it. We then substitute "b" with epsilon, prune "b" from the alphabet and minimize the transducer.

Question: Should this total operation always leave A unchanged or not? Consider especially the case where A recognises one @_UNKNOWN_SYMBOL_@.

  • Proposal 1: A should always be unchanged.
  • Proposal 2: After substituting "b" with epsilon, we should expect to see an epsilon path in the transducer.
  • Proposal 3: "b" should have been removed from A, including the implicit "b" in @_UNKNOWN_SYMBOL_@.

Answer: Proposal 2 is the same that is used in XFST: [ ? / b ] (freely inserting b into [?]) yields the transducer [b* [?|b] b*]. Then substituting b with 0 yields the transducer [0* [?|0] 0*] which is equivalent to the minimal transducer [?|0]. So, because freely inserting always harmonizes the transducer in XFST, it is possible that for a transducer Transducer and symbol symbol:

`[ [ Transducer / symbol ] , symbol , 0 ]
is not equivalent to Transducer.


XFST's substitution is defined as follows

`[ [A] , s , L ]
which denotes the language of relation derived from A by substituting every symbol x in the list L, for every occurrence of the symbol s. For example, `[ [a -> b] , b , x y z ] denotes the same relation as [a -> [x | y | z]]. If the list L is empty, all the strings and pair of strings containing the symbol are eliminated. For example, `[ [a | b:c | c] , b , ] is equivalent to [a | c].

Question: Should substitute always cause the symbol being replaced to be removed from the alphabet?

Answer: Yes, at least in XFST's substitution. Other substitution functions (there are many varieties of them in HFST) are not well defined in terms of XFST syntax.

Question: What should the substitute function do if a non-exist symbol is trying to be replaced?

For example:

 `[ [?:a] , m , k ]
which tries to replace symbol m with symbol k in the transducer [?:a].

Answer: XFST doesn't do anything:

s0: a -> fs1, ?:a -> fs1

Question: How is substitution done in XFST?

Answer: Substituting symbol s with a list of symbols L in transducer A:

  • See if s is in the alphabet. If it is not, return and leave A unchanged.
  • Else, remove s from the alphabet and add all symbols in L to the alphabet.
  • For all transitions, substitute s with each symbol in L. Do not perform any expansion of unknown or identity symbols. (If L is empty, remove all transitions where s is either input or output symbol or both.)

Explanation on how this behaviour is achieved in the HFST regexp parser (`[ A , s, L ]).

removed: (1. To avoid harmonization of transducer A with list L, add all the symbols from the list L to the alphabet of the transducer A.)

2. Substitute transitions s:s with L. This step is done with library function HfstTransducer & substitute (const StringPair &symbol_pair, HfstTransducer &transducer, bool harmonize=true), where symbol_pair is s:s, transducer is [ x | y | z ] if L is x y z, and harmonize is false.

Special case: If L is empty, transducer is an empty transducer. In that case, we substitute each symbol pair in A whose input or output symbol is s (including the case s:s) with the symbol pair s:s before calling HfstTransducer & substitute (const StringPair &symbol_pair, HfstTransducer &transducer, bool harmonize=true). After that, we can return.

If L is not empty, we need to handle transitions x:s and s:x where x is any other symbol than s:

3. [ A .o. s -> L ] - for cases when s is on right side

4. [ A.i .o. s -> L].i - for cases when s is on left side

Discovered bugs

Currently, removing a nonexistent symbol from the alphabet results in an ugly crash. -- SamHardwick - 2013-02-07

Acctually, it used to crash when trying to remove symbol from the alphabet which was used in the transducer's transitions. However, it looks like this bug was fixed at some point. Now, when I try to remove symbol from the alphabet, if it is used in the transducer, it is replaced with epsilon. -- SenkaDrobac - 2013-02-12

Removing a symbol from a transducer where it is used does not crash anything, but trying to use the transducer after that (e.g. in a binary function) causes a segmentation fault. I think this is not a problem because the user is advised to use remove_from_alphabet with care. -- ErikAxelson - 2013-03-05

In XFST syntax, [?:?] is the same as [?:? | ?]. Also, [?:a] is the same as [?:a | a]. Both of these bugs are now fixed in HFST xre parser.

Topic revision: r13 - 2013-03-05 - ErikAxelson
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback