HFST: Use of other and Similar Metasymbols in Symbol (Pair) Calculus

This page is withdrawn from further processing. I started again with HfstOtherAsSimpleSymbols -- AnssiYliJyra - 29 Sep 2008.

Basic Definitions

To implement a restriction from the set of possible symbols to the sets of symbols used in a language L, we need alphabets.

Definition. An alphabet is a structure A=(DomainOfDefinition,Blocks,ExplicitBlocks,NullableBlocks,DisallowedBlocks) whose members are defined as follows:

  • DomainOfDefinition: the set of possible symbols (e.g. the set of strings),
  • Partition: a partition of DomainOfDefinition (corresponds closely to a key table),
  • ExplicitBlocks: blocks whose content is easily listed (the keys in a key table),
  • NullableBlocks: blocks containing symbols whose free occurrences in strings carry no difference to the meaning of the strings: If L is a language over A and 0∈∪B∈NullableBlocksB, then the following statements are equivalent: xy∈L and x0y∈L.
  • DisallowedBlocks: blocks that cover symbols whose occurrences in strings indicate that the string does not belong to the language L (compare: a key table can contain a key that is not used in transitions).

Why NullableBlocks and DisallowedBlocks are sets of blocks? In the case of one-tape alphabets they are actually empty or singletons. For pair alphabets, it is necessary to define at least two implicit blocks (for identity pairs and for irreflexive pairs). In closed alphabets, these implicit blocks are among the disallowed blocks. Alternatively, it is possible to mention symbols that are disallowed, while the open-class symbols are used in strings. Thus, it is possible that 2 blocks are disallowed. The same holds for the empty string: in addition to the block that constains the usual zero symbol, an open-class of symbols can be treated as invisible symbols.

An alternative to the current structure is one that assumes that open-class symbols share their block with some listed symbols. Such a represention has some advantages and disadvantages.

Definition. A simple alphabet is an alphabet QA1×A2=(DomainOfDefinition,Partition,OtherBlocks,NullableBlocks,DisallowedBlocks) where

  • the cardinality of ExplicitBlocks, NullableBlocks, and DisallowedBlocks is 1.

Definition. A pair alphabet is an alphabet PA1×A2=(DomainOfDefinition,Partition,ExplicitBlocks,NullableBlocks,DisallowedBlocks) where A1 and A2 are alphabets, and DomainOfDefinition&61;DomainOfDefinitionA1×DomainOfDefinitionA2.

Example 1. The language a b* c use the alphabet ({0,a,b,...}, {{0},{a},{b},{c},{d,e,...}}, {{0},{a},{b},{c}}, {{0}}, {{d,e,...}}).

Example 2. The language ? a ?:? use the alphabet ({0,0:a,...,0:z,a:0,a,a:b,...a:z,b:0,b:a,b,b:c,...,z:y,z}, {{0},{a}, a:?∪?:a∪?∪?:? }, {{0},{a}}, {{0}}, {}).

Derivable Properties of Alphabets

In every alphabet A,

  • the KnownAlphabet of A is defined as B∈ExplicitBlocksB
  • the UselessAlphabet of A is defined as B∈DisallowedBlocksB,
  • the NullableAlphabet of A is defined as B∈NullableBlocksB, and
  • the UnknownAlphabet of A is defined as DomainOfDefinition - KnownAlphabet,
  • the UsefulAlphabet of A is defined as DomainOfDefinition - UselessAlphabet,
  • the LetterAlphabet of A is defined as DomainOfDefinition - NullableAlphabet,
  • the Blocks of A is the set of blocks in the Partition,
  • the ImplicitBlocks of A is defined as Blocks - ExplicitBlocks,
  • the AllowedBlocks of A is defined as Blocks - DisallowedBlocks,
  • the LetterBlocks of A is defined as Blocks - NullableBlocks,
  • the mapping DomainOfDefinition→Blocks defined by A is denoted by fA.

In every pair alphabet P, set ImplicitBlocks is covered by the following disjoint sets of blocks:

  • IdentityOtherBlocks = {{a:b| a=b ∧ a:b∈UnknownAlphabetA1 ∧ b∈UnknownAlphabetA2}} - {{}},
  • IrreflexiveOtherBlocks = {{a:b| a≠b ∧ a∈UnknownAlphabetA1 ∧ b∈UnknownAlphabetA2}} - {{}},
  • OtherIndentityPairBlocks = {{a:b| a=b ∧ a:b∈UnknownAlphabet ∧ a:b∉IdentityOtherBlocks }} - {{}},
  • OtherIrreflexivePairBlocks = {{a:b| a≠b ∧ a:b∈UnknownAlphabet ∧ a:b∉IrreflexiveOtherBlocks }} - {{}}.

Why there are two or more other-symbol blocks?

  • two blocks are necessary because we want to refer to unknown identity-pairs (idenitity-other metasymbol and IdentityOtherBlock block) in separation from other unknown pairs with knwon or unknown symbols
  • NOTE! The number of implicit other-blocks may be exactly two like proposed by Kimmo. This make convey some transducers in a bigger form, but we do not know how often this happens.

The Alphabet Lattice

Definition. An alphabet A1 is a refinement of an alphabet A2, i.e. A1≤A2, if the alphabets share DomainOfDefinition and there is a mapping r:PartitionA1→PartitionA2 such that fA1(a) = fA1(b) implies r(fA1(a))=r(fA1(b)) i.e. every block in A1 is contained to a block in A2.

Lemma 1. Given two alphabets A1 and A2 with the same DomainOfDefinition, we can always compute alphabet A1∧A2 such that A1∧A2 ≤ A1 and A1∧A2 ≤ A2.

Lemma 2. Given two alphabets A1 and A2 with shared DomainOfDefinition, we can always compute alphabet A1∨A2 such that A1 ≤ A1∨A2 and A2 ≤ A1∧A2.

Corollary. If we fix the DomainOfDefinition in alphabets, there is a lattice (DomainOfDefinition,≤). Element A1∨A2 is called the meet (aka greatest lower bound or infimum) of A1 and A2, and element A1∧A2 is called the join (aka least upper bound or supremum) of A1 and A2. The lattice ({1,2,3,4},≤) of alphabets can be illustrated by the following Hasse diagram:

Currently, we do not know any algorithms that would use the meet of alphabets. Such could be useful, however, in regular language approximations using covers.

Term Definition. Computing the join of two alphabets is called harmonization of alphabets.

Application of the principle of divide-and-conquer: Mutual harmonization of pair alphabets_ Pa, Pb can be seen as an operation where every member of both alphabets is harmonized. Harmonization of transducers with alphabets Pa, Pb can be seen as an operation where both alphabets are harmonized and then the labels of transitions in both transducers are harmonized

Extending the Definition of Alphabets with Metasymbols

Definition. Let AlphabetsDomainOfDefinition be the set of alphabets definable over some common DomainOfDefinition. A function f: AlphabetsDomainOfDefinition→2DomainOfDefinition is called a metasymbol.

  • Metasymbol f is join-distributive if if it satisfies the condition f(A1∧A2) = f(A1)∩f(A2).
  • Metasymbol f is meet-distributive if if it satisfies the condition f(A1∨A2) = f(A1)∪f(A2).

Definition. Denote the set of metasymbols over the alphabet of DomainOfDefinition by MetaDomainOfDefinition.

Definition. Let M1⊆MetaDomainOfDefinition and M2⊆M_1. An elimination of metasymbols &Delta=M1 - M2 is a function elimΔ,A:2MetaDomainOfDefinition→2MetaDomainOfDefinition that returns for any metasymbol m∈M1 and alphabet A a set of metasymbols M⊆M2. There are metasymbols that generally cannot be eliminated by listing the covered concrete pairs.

The following important eliminations use the following equivalences that assume a fixed simple alphabet A:

meta symbol regexp network the semantics in terms of simpler meta symbols
known-symbol     ExplicitBlocks
unknown-symbol   ? ImplicitBlocks
epsilon-symbol 0 0 NullableBlocks
letter-symbol     LetterBlocks
useless-symbol     DisallowedBlocks
useful-symbol     AllowedBlocks

The following important eliminations use the following equivalences that assume a fixed alphabet P:

support meta symbol regexp network the semantics in terms of simpler meta symbols
blocks known-pair .   ExplicitBlocks
unknown-symbol   ? ImplicitBlocks
identity-pair-of-other-symbols   ? IdentityOtherBlock
irreflexive-pair-of-other-symbols   ?:? IrreflexiveOtherBlock
other-identity-pair   <?> OtherIdentityPairBlock
other-irreflexive-pair   <?:?> OtherIrreflexivePairBlock
epsilon-pair 0:0
letter-pair     LetterBlocks
useless-pair     DisallowedBlock
useful-pair     AllowedBlocks
lists known-identity-pair     { a:b | a=b      ∧ a:b∈KnownAlphabet }
known-irreflexive-pair     { a:b | a≠b ∧ a:b∈KnownAlphabet }
a:known-pair a:.   { a:b | a:b ∈ KnownAlphabet }
known-pair:a .:a   { b:a | b:a ∈ KnownAlphabet }
a:irreflexive-pair-of-known-symbols <?>:a   { a:b | a≠b ∧ b ∈ A2.!KnownAlphabet }
irreflexive-pair-of-known-symbols:a <?>:a   { b:a | a≠b ∧ b ∈ A1.!KnownAlphabet }
a:irreflexive-pair-of-other-symbols   a:? { a:b | a≠b ∧ a:b ∈ A2.!UnknownAlphabet }
irreflexive-pair-of-other-symbols:a   ?:a { b:a | a≠b ∧ b:a ∈ A1.!UnknownAlphabet }
metasyms other-pair     {other-identity-pair,other-irreflexive-pair}
pair-of-other-symbols   ?∪?:? {identity-pair-of-other-symbols,irreflexive-pair-of-other-symbols}
mixed-other-pair   <.> {other-pair,pair-of-other-symbols}
any-identity-pair ?   {known-identity-pair, identity-pair-of-other-symbols, other-identity-pair}
any-irreflexive-pair ?:? - ?   {known-irreflexive-pair, irreflexive-pair-of-other-symbols, other-irreflexive-pair}
any-pair ?:?, <.>   {any-identity-pair,any-irreflexive-pair}
a:any-pair a:?   {a:a, a:irreflexive-pair-of-known-symbols, a:irreflexive-pair-of-other-symbols}
any-pair:a ?:a   {a:a, irreflexive-pair-of-known-symbols:a, irreflexive-pair-of-other-symbols:a}

The metasymbols that do not look like pairs can be converted to pairs using the convention x=x:x.

Harmonization of Alphabets through Elements of Alphabets

It appears to me that it is slightly easier to consider first individual pairs and their combinations and then to go into the generalizations that iterate over all pairs in two transducers. I pursue toward such calculus of pais in this section. I call the calculus "symbol pair calculus".

Definition. Let E be the union of alphabets Esym and Emeta. Let Emeta be {other-pairs, idenitity-other-relation, irreflexive-other-relation}. The set of all abstract symbol pairs over E, P(E), is defined to contain the following:

  • not-available:not-available (used when an operation fails, e.g. if we complement <.> we should get nothing, i.e. the empty set)
  • other-pairs:other-pairs (used to refer to the complement of the known set of pairs)
  • identitity-other:identitity-other
  • irreflexive-other:irreflexive-other
  • a:a, a:other-pairs and other-pairs:a, a:irreflexive-other and irreflexive-other:a, where a∈ Esym
  • a:b where a,b∈ Esym.
No other pair a:b where a,b∈ E is an abstract symbol pair.

Definition. Let Pa, Pb, and Pc be sets of listed abstract symbol pairs over E. Let a:a'∈Pa, b:b'∈Pb and c:c'∈Pc. The following operations are called abstract symbol-pair operations. The result of the operations are sometimes set-valued.

  • C(a:a') where a:a'∈Pa (close the alphabet by removing =other=-like symbol)
  • NPb(a:a') where a:a'∈Pa (used to compute complement)
  • KPa(a:a') where a:a'∈Pa (applied when the set of possible pairs that need to be listed is fixed)
  • HPa,Pb(a:a') where a:a'∈Pa (needed in harmonization)
  • a:a' ∩ b:b' (needed in intersection and determinization)
  • a:a' ∪ b:b' (needed when computing equivalent symbol pairs or harmonized alphabets)
  • a:a' - b:b' (needed when computing difference)
  • a:a' xor b:b' (needed when computing exclusive or of languages)
  • a:a'  o  b:b' (needed in composition)
  • a:a'  o  b:b'  o  c:c' (needed in three-way composition that is more efficient than two compositions)
  • I(a:a') and O(a:a') (needed in projection)
  • I-1(a') and O-1(a) (needed in oriented contexts and surface coercion of GTWOL, useful also for computing crossproduct)
  • V(a:a') (needed in inversion)

-- AnssiYliJyra - 26 Sep 2008


Lemma 1. Given two alphabets A1 and A2 with the same DomainOfDefinition, we can always compute alphabet A1∧A2 such that A1∧A2 ≤ A1 and A1∧A2 ≤ A2.

Proof of Lemma 1. Construct A1∧A2=(DomainOfDefinition, PartitionA1∧A2, ExplicitBlocksA1∧A2, NullableBlocksA1∧A2, DisallowedBlocksA1∧A2) as follows:

  • PartitionA1∧A2= ∪B1∈PartitionA1B2∈PartitionA2 { B1∩B2 } - { {} }
  • ExplicitBlocksA1∧A2= (∪B1∈ExplicitBlocksA1B2∈PartitionA2 { B1∩B2) }) ∪ ∪B1∈PartitionA1B2∈ExplicitBlocksA2 { B1∩B2) } - { {} }
  • NullableBlocksA1∧A2= ∪B1∈NullableBlocksA1B2∈NullableBlocksA2 { B1∩B2 } - { {} }
  • DisallowedBlocksA1∧A2= ∪B1∈DisallowedBlocksA1B2∈DisallowedBlocksA2 { B1∩B2) } - { {} }.

Lemma 2. Given two alphabets A1 and A2 with shared DomainOfDefinition, we can always compute alphabet A1∨A2 such that A1 ≤ A1∨A2 and A2 ≤ A1∧A2.

Proof of Lemma 2. The alphabet of universal language induces the coarsest possible alphabet (DomainOfDefinition,{DomainOfDefinition},{DomainOfDefinition},{},{}}. The existence of this alphabet suffices. Q.E.D.

Topic revision: r7 - 2008-09-29 - AnssiYliJyra
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback