HFST: "Other" symbols in use

Open Input and Output Alphabets in Two-Level Grammars Is a Good Thing

Actually, it might be a good idea to allow two-level rules to have an open input and output alphabet which would let the lexicographer to add entries which contain any Unicode characters. Then the rule component should have an option for identity pairs. Possibly, this could be implemented using a special symbol (maybe even the ?) and to use the pair harmonization while compiling and joining the rules, and then symbol harmonization in order to let the rule component allow for any symbols which might be present in the lexicon. (Please, comment and check!)

-- KimmoKoskenniemi - 11 Sep 2008

other in replace rules

Summary: Replace rules need non-identity pairs when compiled with Generalized Two-Level Grammars or the GR operation.

Methods that have been used to compile replace rules indeed assume that no change takes places in substrings where no replace rule is applied. From this, it follows that in the resulting transducer, transitions labeled with other:other occur only inside the "centre". In addition, context conditions of Karttinen's replace rules (Karttunen 1995) and some other rule formalisms are oriented i.e. given as one-tape languages. This would give an impression that ?:? is a strange special case. In the following three counter arguments are given.

  1. In XFST, replace rules can also map unknown symbols to unknown symbols. For example, replace rules like ? -> ? are possible in XFST:
    xfst[0]: read regex [ a b c  .o. [ ? -> ? ] ];     
    304 bytes. 4 states, 12 arcs, 64 paths.
    xfst[1]: print random-words
    If we look the rule ? -> ? more closely, we observe that it results into transducer:
    fs0:  ? -> fs0, <?:?> -> fs0.
  2. While the contexts are oriented, they can still match with regions where changes take place. This is the cause for the inequivalence between rules that use left-oriented, right-oriented, upward oriented or downward oriented rules: depending on the orientation, the rules may be reading one another's output. Thus, the compiled transducer can contain transitions that come from an oriented context condition and a change between input and output substrings. For example, if the language used in a right context condition on the input tape is specified as a ?* b it will match the set of pair-symbol strings that could be expressed by ?:0* a:? ?:?* b:? ?:?*. This kind of pair-symbol transducers are obtained by the inverse of the projection operator, which has been implemented in AFST but is not known to exist in other programming languages. This operation is employed in by Yli-Jyrä and Koskenniemi (2006, 2007) and Yli-Jyrä (2007a, 2007b).
  3. The method of Yli-Jyrä (2007) compiles replace rules with oriented contexts first into two-level rules. In addition, it is possible that the contexts of replace rules are specified as two-level contexts over a pair-symbol alphabet. Such contexts have applications in replacement systems that never change the length of strings. When two-level contexts are used with these replace rules, the contexts may correspond to paths that have transitions on ?:? as well as with ?.
  4. An interesting difference between Karttunen's (1995) replace rules and the rules of van Noord and Gerdemann (1998) and Yli-Jyrä (2007) is that Karttunen's replace rules cannot express ? inside the centers. This is one reason for the introduction of markup rules with a separate syntax by Karttunen (1995).
  5. Two-level rules and parallel replace rules converge, which means that their uses also converge (at least by reduction to Generalized TWOL grammars):
    • FSTs from simultaneus applications of rewriting rules was studied also by Johnson (1971). Koskenniemi (1983) suggest that Two-level rules would resemble simultaneus application.
    • Kempe and Karttunen (1996), Skut et al (200x), and Yli-Jyrä (2008) have presented compilation methods for a set of parallel replace rules.
    • Kempe and Karttunen (1996) suggest that parallel replace rules resemble two-level rules to some extent.
    • Yli-Jyrä (2008) show explicitly that obligatory and optional parallel replace rules can be compiled using bracketed generalized two-level grammars. This relationship reduces the difference between parallel replacement rules and parallel constraints. In addition, it suggests that the most important difference between unbracketed generalized two-level grammars and parallel replace rules/bracketed generalized two-level rules is that the applications of the latter rules do not overlap.
    • According to Hulden (2008), replace rules can be compiled with a technique that does not assume bracketing at all. The result is preliminary.
  6. Generalized Two-Level Grammars employ the GR operation that employs negation and diamonded double negation.

-- AnssiYliJyra - 25 Sep 2008

Taxonomy of Harmonization Methods

There are three approaches to harmonizations during composition or intersection operation:
  1. On-Demand Harmonization. The algorithm implementing the operation consults the label sets of each operand FST while combining their labels that are not harmonized earlier.
    1. symbol harmonization
    2. pair harmonization
      1. This approach was used in AFST-20080214 when implementing the intersection operation where the operand FSTs contained ? .x. ? (pair of others with no identity constraint). The following example from conjoin_side in operators.an.C is about intersecting symbol pairs l=l1:u1 and k=l2:u2 that are labels of a pair of intersected transitions.
           if ( l == k )                                /* a:b cap a:b = a:b */
           if ( l == anysymb )                          /* ?:? cap a:b = a:b */
           else if ( l1 == anysymb ) {                  /* ?:a cap ... */
              if ( u2 == anysymb )                        /* ?:a cap  b:? = b:a  */
              if ( u1 != u2 ) continue;                   /* ?:a cap   :b = NONE */                                    
                                                          /* ?:a cap  b:a = b:a  */
           else if ( u1 == anysymb ) {                  /* a:? cap ... */
              if ( l2 == anysymb )                        /* a:? cp  ?:b = a:b   */
              if ( l1 != l2 ) continue;                   /* a:? cap  b:  = NONE */                                    
                                                          /* a:? cap  a:b = a:b  */
    3. identity harmonization
  2. Prepared On-Demand Harmonization. Computing the set of symbol pairs that may be used in the result network.
    1. symbol harmonization
      1. In SFST, this means precomputing the symbols and symbol pairs of the composed FSTs. The transition labels of the new FST use the new key values of the symbols.
      2. In SkeemaParser (1998), this meant precomputing of the symbol classes of the intersected FSA. The transition labels of the new FST use the key values of these new symbol classes.
    2. pair harmonization
      1. In AFST-20080214, the network copying during the concatenation operation uses this pair harmonization approach in the following solution where the diff is a previously computed symbol pair set containing those symbols that were not listed in the current component FST but were listed in the other. This is from Transducer::copy_open_nodes in operators.an.C.
        if ( l == Label::anysymb ) { 
           for( LabelSet::const_iterator it=diff.begin(); it!=diff.end(); it++ ) 
              nnode->add_arc( *it, tn, a );
        } else if ( l.lower_char() == Label::anysymb ) {
           for( LabelSet::const_iterator it=diff.begin(); it!=diff.end(); it++ ) 
              if ( it->upper_char() == l.upper_char() ) nnode->add_arc( *it, tn, a );
        } else if ( l.upper_char() == Label::anysymb ) {
           for( LabelSet::const_iterator it=diff.begin(); it!=diff.end(); it++ ) 
              if ( it->lower_char() == l.lower_char() ) nnode->add_arc( *it, tn, a );
    3. identity harmonization
  3. Preemptive Harmonization. In addition to 2, splitting some transitions in the operand networks before the applying the standard algorithm for these networks. This is the approch proposed in this paper and its advantage is that it is independent from the algorithms that combine the FSTs because it harmonizes FSTs separately.
    1. symbol harmonization (see above)
    2. pair harmonization (see above)
    3. identity harmonization
  4. Implicit Harmonization. The network uses abstract operations for intersecting two pairs. By defining the abstract operations on such pairs suitably, we can define the semantics of such pairs.
-- AnssiYliJyra - 25 Sep 2008
Topic revision: r1 - 2008-10-01 - AnssiYliJyra
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback