Difference: HfstTwolC (1 vs. 75)

Revision 752017-03-08 - ErikAxelson

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 607 to 607
 
CONTAINMENT_ONCE $.
CONTAINMENT $
ANY ?
Changed:
<
<
UNION = =
>
>
UNION
|
 
INTERSECTION &
POWER ^
DIFFERENCE -

Revision 742016-05-18 - KristerLinden

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 771 to 771
  hfst-twolc is a part of HfstCommandLineTools.
Deleted:
<
<

 
<--  
-->
Deleted:
<
<
-- MiikkaSilfverberg - 13 May 2008
 
META TOPICMOVED by="KristerLinden" date="1212070743" from="KitWiki.HFSTTwolC" to="KitWiki.HfstTwolC"
Added:
>
>
META PREFERENCE name="VIEW_TEMPLATE" title="VIEW_TEMPLATE" type="Set" value="FinCLARIN.ViewFinClarinWideEngTemplate"

Revision 732016-01-14 - ErikAxelson

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 25 to 25
 
-R, --resolve Attempt to resolve left-arrow conflicts between rules. If omitted, left arrow conflicts aren't resolved.
-D, --dont-resolve-right Don't resolve right arrow conflicts. If omitted, right arrow conflicts are resolved.
-w, --weighted Compile the rules into weighted transducers with zero weights.
Added:
>
>
-f, --format FORMAT Store result in format FORMAT.
 
-v, --verbose Display detailed information concerning the compilation process.
-h, --help Display a help-message.
-u, --usage Display usage.
Added:
>
>
FORMAT may be one of openfst-log, openfst-tropical, foma or sfst. By default format is openfst-tropical.
 

Outline

Terms and concepts:

Revision 722014-02-10 - ErikAxelson

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Revision 712013-12-04 - ErikAxelson

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 13 to 13
 USAGE: hfst-twolc [ OPTIONS ] [ GRAMMARFILE ]
Added:
>
>
Note: currently hfst-twolc is hfst-twolc.bat on Windows.
 

Parameters

Parameter name Meaning

Revision 702013-03-21 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 20 to 20
 
-o, --output If omitted, the resulting transducer is written to STDOUT.
-s, --silent Don't print any diagnostics messages.
-q, --quiet Don't print any diagnostics messages.
Changed:
<
<
-R, --resolve Attempt to resolve conflicts between rules. If omitted, conflicts aren't resolved.
>
>
-R, --resolve Attempt to resolve left-arrow conflicts between rules. If omitted, left arrow conflicts aren't resolved.
-D, --dont-resolve-right Don't resolve right arrow conflicts. If omitted, right arrow conflicts are resolved.
 
-w, --weighted Compile the rules into weighted transducers with zero weights.
-v, --verbose Display detailed information concerning the compilation process.
-h, --help Display a help-message.

Revision 692012-11-19 - KimmoKoskenniemi

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Revision 682012-10-03 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 101 to 101
 
  • The Kleene-star *. The language R* matches any string, which is the concatenation of any number of string from R. Note that the empty string, which is the concatenation of zero strings also matched. E.g. a* matches the empty string, a, a a, a a a and so on.
  • The plus-operator resembles *, but only matches strings, which are concatenation of a positive number of strings from R. Consequently R+ matches the empty string, iff R matches the empty string. E.g. a+ matches a, a a, a a a and so on.
Changed:
<
<
In addition to unary operators there are three binary operators, which may be used to build regular expressions out of existing ones. Binary operators have the lowest precedence. Hence, e.g. a b* | c d is equivalent to [ a b* ] | [ c d ] and will match anything matched by a b* or by c d. One can group expressions together so a [ b * | c ] d will match a string beginning with a followed by zero or more b symbols or a c and ending with a d.
>
>
In addition to unary operators there are four binary operators, which may be used to build regular expressions out of existing ones. Binary operators have the lowest precedence. Hence, e.g. a b* | c d is equivalent to [ a b* ] | [ c d ] and will match anything matched by a b* or by c d. One can group expressions together so a [ b * | c ] d will match a string beginning with a followed by zero or more b symbols or a c and ending with a d.
  Let R and S be regular expressions. The binary operators are:
  • The disjunction-operator |. The language R | S matches any string matched by R or S and only those.
  • The conjunction-operator &. The language R & S matches any string matched by both R and S and only those.
  • The difference-operator -. The language R - S matches any string matched by R, but not by S and only those.
Added:
>
>
  • Ignore operator /. The language R / S matches any string which is a string in R which may have some strings in S inserted. E.g. a+/b matches e.g. aba, aa and ab.
  By default the binary operations bind from the left. Hence a - a - a is equivalent to [ a - a ] - a i.e. matches the empty language. If the binary operators were to bind from the right, then a - a - a would be equivalent to a - [ a - a ] i.e. equivalent to a.
Line: 596 to 597
 
PLUS +
COMPLEMENT ~
TERM_COMPLEMENT \
Added:
>
>
FREELY_INSERT /
 
CONTAINMENT_ONCE $.
CONTAINMENT $
ANY ?

Revision 672012-05-30 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 344 to 344
  In a description of the phonology of the Kyrgyz language, the preceding surface vowel determines the realization of the archiphoneme {A}, but the realization also depends on whether an archiphoneme {U} and an optional morpheme boundary follow. The two level rule which governs the realization of archiphoneme {A}, when there is no following archiphoneme {U}, looks like this
 "Vowel harmony for archiphoneme {A}"
Changed:
<
<
{A}:Vy <=> [ :LastVowel :Cns* [ :Cns - й: ] ]/[ :0 | %>: ] _ [ ( %>: ) [ \[ %>: | %{U%}: ] | .#. ] | %>: %>: ] ;
>
>
{A}:Vy <=> [ :LastVowel :Cns* [ :Cns - й: ] ] \[ :0 | %>: ] _ [ ( %>: ) [ \[ %>: | %{U%}: ] | .#. ] | %>: %>: ] ;
  [ %{A%}:LastVowel ] _ [ ( %>: ) [ \[ %>: | %{U%}: ] | .#. ] | %>: %>: ] ;

where LastVowel in ( и ү е э ө я а )

Line: 355 to 355
  Using negative contexts, the rule "Vowel harmony for archiphoneme {A}" can be formulated by referring directly to the prohibited right context [ ( %>: ) %{U%}: ]
"Vowel harmony for archiphoneme {A}"
Changed:
<
<
{A}:Vy <=> [ :LastVowel :Cns* [ :Cns - й: ] ]/[ :0 | %>: ] _ ;
>
>
{A}:Vy <=> [ :LastVowel :Cns* [ :Cns - й: ] ] \[ :0 | %>: ] _ ;
  [ %{A%}:LastVowel ] _ ;

except

Revision 662012-04-12 - ErikAxelson

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 766 to 766
 
-- MiikkaSilfverberg - 13 May 2008

Revision 652012-02-07 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 20 to 20
 
-o, --output If omitted, the resulting transducer is written to STDOUT.
-s, --silent Don't print any diagnostics messages.
-q, --quiet Don't print any diagnostics messages.
Changed:
<
<
-r, --resolve Attempt to resolve conflicts between rules. If omitted, conflicts aren't resolved.
>
>
-R, --resolve Attempt to resolve conflicts between rules. If omitted, conflicts aren't resolved.
 
-w, --weighted Compile the rules into weighted transducers with zero weights.
-v, --verbose Display detailed information concerning the compilation process.
-h, --help Display a help-message.

Revision 642011-10-19 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Deleted:
<
<
Will be updated shortly when hfst3 becomes available.
 

Purpose

Line: 23 to 21
 
-s, --silent Don't print any diagnostics messages.
-q, --quiet Don't print any diagnostics messages.
-r, --resolve Attempt to resolve conflicts between rules. If omitted, conflicts aren't resolved.
Deleted:
<
<
-N, --names If this option is given, the names of the rules in the grammar file are saved in a file .names. This file may given as paramteter to the utility HfstPairTest.
 
-w, --weighted Compile the rules into weighted transducers with zero weights.
-v, --verbose Display detailed information concerning the compilation process.
-h, --help Display a help-message.

Revision 632011-09-20 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 373 to 373
  The list of negative contexts should follow the list of positive contexts. The list of negative contexts has to start with the keyword except.

Added:
>
>
"Example of rule syntax. x:y in LEFT _ RIGHT except when the context is L1 _ R1 or L2 _ R2"
x:y <=> LEFT _ RIGHT ;
        
    except
        L1   _ R1    ;
        L2   _ R2    ;
 

Regular expression center rules

Some languages incorporate alternations, which are difficult to describe using regular twolc rules, which only concern a single symbol pair. It can e.g. be cumbersome to describe a choice of affix which is conditioned on phonological context, when the the affixes consist of multiple symbols. Such penomena are more conveniently described using rules with regular expression centers.

Revision 622011-08-29 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 636 to 636
  matched; ^ HERE Aborted.
Deleted:
<
<
Warning, important Warnings about unequally long value lists aren't working porperly.
 

Resolution of Conflicts between the Rules

Revision 612011-08-28 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 441 to 441
  Their semantics is the same as for ordinary twolc rules.
Changed:
<
<
Regular expression center rules can be used in implementing the classical example of Arabic morphology. The following grammar can be used to derive two noun and two verb forms from three consonant roots. The grammar implements the example in Table 1. in Attia et al. 2011:
>
>
Regular expression center rules can be used to describe non-concatenative phenomena such as derivation of words from consonant stems in Arabic by adding vowels between the consonants. The following grammar derives two noun and two verb forms from roots consisting of three consonants. The grammar implements the example in Table 1. in Attia et al. 2011:
 
Changed:
<
<
Root drs
>
>
Root drs
 
Patterns R1aR2aR3a R1aR2R2aR3a R1aaR2iR3 muR1aR2R2iR3
POS V V N N
Stem darasa darrasa daaris mudarris
Changed:
<
<
  'study' 'teach' 'student' 'teacher'
>
>
ENG 'study' 'teach' 'student' 'teacher'
 
Alphabet
Line: 475 to 475
  Rules
Changed:
<
<
!! These rules transform underlying forms
>
>
!! These rules transform the underlying forms
 !!
Changed:
<
<
!! "(m u :0) C C C "
>
>
!! C1 C2 C3 !! C1 C2 C3 !! C1 C2 C3 !! m u C1 C2 C3
 !!
Changed:
<
<
!! into surface realizations. Here C is a consonant, the consonant
>
>
!! into surface realizations. Here C1, C2 and C3 are consonants, the consonant
 !! doubling morphophoneme and the abstract vowel symbol.

!! For future improvement of hfst-twolc: Need to implement variables in

Line: 521 to 524
 %<C%>:CNS => CNS _ ; where CNS in cns ;
Added:
>
>
The grammar will generate the following forms from their underlying representations
         d <V> <V> r <C> <V> s <V> <NOUN1>  -->       d a a r 0 i s 0 0
m u <MU> d <V> <V> r <C> <V> s <V> <NOUN2>  --> m u 0 d a 0 r r i s 0 0
         d <V> <V> r <C> <V> s <V> <VERB1>  -->       d a a r 0 a s a 0
         d <V> <V> r <C> <V> s <V> <VERB2>  -->       d a 0 r r i s 0 0
 Warning, important Regular expression center rules do not participate in conflict resolution. Rule conflicts resulting from regular expression center rules are not detected by hfst-twolc.

Warning, important If you use regular expressions with stars, plusses or pairs 0:x in the center, the rules become very difficult to understand, so it's probably best to use relatively simple center languages.

Revision 602011-08-28 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 441 to 441
  Their semantics is the same as for ordinary twolc rules.
Added:
>
>
Regular expression center rules can be used in implementing the classical example of Arabic morphology. The following grammar can be used to derive two noun and two verb forms from three consonant roots. The grammar implements the example in Table 1. in Attia et al. 2011:

Root drs
Patterns R1aR2aR3a R1aR2R2aR3a R1aaR2iR3 muR1aR2R2iR3
POS V V N N
Stem darasa darrasa daaris mudarris
  'study' 'teach' 'student' 'teacher'

Alphabet

!! Special symbols, which mark different derivations 
!! using root and pattern interdigitation.
!!
!! These forms correspond to the examples in Table 1 in the
!! Attia et al. SFCM 2011 article.
%<VERB1%>:0 %<NOUN1%>:0 %<VERB2%>:0 %<NOUN2%>:0

!! The different realizations for the abstract vowel <V>. It can
!! be realized as any surface vowel or epsilon.
%<V%>:e %<V%>:y %<V%>:u %<V%>:u %<V%>:i %<V%>:o %<V%>:a 
%<V%>:0

q w r t p s d f g h j k l z x c v b n m

a e i o u y 
;

Sets

!! Consonants
cns = q w r t p s d f g h j k l z x c v b n m;

Rules

!! These rules transform underlying forms 
!!
!! "(m u <MU>:0) C <V> <V> C <C> <V> C <V>" 
!!
!! into surface realizations. Here C is a consonant, <C> the consonant 
!! doubling morphophoneme and <V> the abstract vowel symbol.   

!! For future improvement of hfst-twolc: Need to implement variables in 
!! regular expression center rules.

"VERB 1 rule"

!! drs -> darasa

<[ cns %<V%>:a %<V%>:0 cns %<C%>:0 %<V%>:a cns %<V%>:a ]> <==> _ %<VERB1%>:0 ;


"NOUN 1 rule"

!! drs -> daaris

<[ cns %<V%>:a %<V%>:a cns %<C%>:0 %<V%>:i cns %<V%>:0 ]> <==> _ %<NOUN1%>:0 ;


"VERB 2 rule"

!! drs -> darrasa

<[ cns %<V%>:a %<V%>:0 cns %<C%>:cns %<V%>:a cns %<V%>:a ]> <==> _ %<VERB2%>:0 ;


"NOUN 2 rule"

!! mu + drs -> mudarris

<[ cns %<V%>:a %<V%>:0 cns %<C%>:cns %<V%>:i cns %<V%>:0 ]> <==> %<MU%>:0 _ %<NOUN2%>:0 ;


"Consonant doubling"

!! %<C%> either vanishes or is realized as the same symbol as the 
!! preceding consonant.

%<C%>:CNS => CNS _ ; where CNS in cns ;
 Warning, important Regular expression center rules do not participate in conflict resolution. Rule conflicts resulting from regular expression center rules are not detected by hfst-twolc.

Warning, important If you use regular expressions with stars, plusses or pairs 0:x in the center, the rules become very difficult to understand, so it's probably best to use relatively simple center languages.

Line: 663 to 743
 
  • A. Yli-Jyrä, K. Koskenniemi, Compiling Generalized Two-Level Rules and Grammars, Advances in Natural Language Processing, Springer Berlin/Heidelberg, pages 174-185, 2006
Added:
>
>
  • M. Attia, P. Pecina, A. Toral, L. Tounsi and J. Van Genabith, A Lexical Database for Modern Standard Arabic Interoperable with a Finite State Morphological Transducer, Proceedings of the Second Workshop on Systems and Frameworks for Computational Morphology, Springer, 2011.
 

Obtaining the program and installing

Revision 592011-08-16 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 371 to 371
  Often negative rule contexts and conflict reolution will produce the same result, but conflict resolution requires that the conflicting rules can be formulated in such a way that they form a chain of subcases. If this is not the case, an unresolvable conflict arises and conflict resolution cannot apply. In such cases negative contexts can be used to restrict one or more of the rules so that their context becomes disjoint from the context of the conflicting rule, which means that no rule conflict arises.
Changed:
<
<
The list of negative contexts should follow the list of positive contexts. The list of negative contexts has to start with the keyword except.
>
>
The list of negative contexts should follow the list of positive contexts. The list of negative contexts has to start with the keyword except.
 

Regular expression center rules

Revision 582011-08-15 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 350 to 350
 {A}:Vy <=> [ :LastVowel :Cns* [ :Cns - й: ] ]/[ :0 | %>: ] _ [ ( %>: ) [ \[ %>: | %{U%}: ] | .#. ] | %>: %>: ] ; [ %{A%}:LastVowel ] _ [ ( %>: ) [ \[ %>: | %{U%}: ] | .#. ] | %>: %>: ] ;
Changed:
<
<
where LastVowel in ( и ү е э ө я а ё о ы ю у ) Vy in ( е ө е е ө а а о о а а а )
>
>
where LastVowel in ( и ү е э ө я а ) Vy in ( е ө е е ө а а )
  matched ; The right context [ ( %>: ) [ \[ %>: | %{U%}: ] | .#. ] | %>: %>: ] states the restriction, that the rule applies only when {A} is not followed by (an optional) >: and {U}:. The right context is quite tricky, because we have to take into account the possibility that {A} is followed by two morpheme boundaries (which could happen in some marginal cases) and that {A}: may be the last pair in a word. This context is so tricky, that it can easily be formulated incorrectly. In cases where the prohibited right context is more complicated, it will be very difficult to find the correct way to prevent it. Hence HfstTwolC allows for using negative contexts to restrict the application of twol-level rules.
Line: 360 to 360
 
 "Vowel harmony for archiphoneme {A}"
{A}:Vy  <=> [ :LastVowel :Cns* [ :Cns - &#1081;: ] ]/[ :0 | %>: ] _ ;
                                              [ %{A%}:LastVowel ] _  ;
Deleted:
<
<
~~[ _ ( %>: ) %{U%}: ]~~ ;
 
Changed:
<
<
where LastVowel in ( и ү е э ө я а ё о ы ю у ) Vy in ( е ө е е ө а а о о а а а )
>
>
except _ ( %>: ) %{U%}: ;

where LastVowel in ( и ү е э ө я а ) Vy in ( е ө е е ө а а )

  matched ;

Often negative rule contexts and conflict reolution will produce the same result, but conflict resolution requires that the conflicting rules can be formulated in such a way that they form a chain of subcases. If this is not the case, an unresolvable conflict arises and conflict resolution cannot apply. In such cases negative contexts can be used to restrict one or more of the rules so that their context becomes disjoint from the context of the conflicting rule, which means that no rule conflict arises.

Changed:
<
<
Negative contexts should be enclosed in negative context brackets ~~[ ... ]~~ and negative contexts can only occur at the end of the rule context list after all regular rule contexts.
>
>
The list of negative contexts should follow the list of positive contexts. The list of negative contexts has to start with the keyword except.
 

Regular expression center rules

Revision 572011-08-13 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 345 to 345
  Sometimes it is easier to formulate rule contexts as the difference of two contexts. In such cases it's possible to use negative contexts.
Changed:
<
<
In a description of the phonology of the Kyrgyz language, the preceding surface vowel determines the realization of the archiphoneme {A}, but the realization also depends on whether an archiphoneme {U} and an optional morpheme boundary follow. The two level rule which governs the realization of archiphoneme {A}, when there is no following archiphoneme {U} looks like
>
>
In a description of the phonology of the Kyrgyz language, the preceding surface vowel determines the realization of the archiphoneme {A}, but the realization also depends on whether an archiphoneme {U} and an optional morpheme boundary follow. The two level rule which governs the realization of archiphoneme {A}, when there is no following archiphoneme {U}, looks like this
 
 "Vowel harmony for archiphoneme {A}"
Changed:
<
<
{A}:Vy <=> [ :LastVowel :Cns* [ :Cns - й: ] ]/[ :0 | %>: ] _ ( %>: ) \[ %>: | %{U%}: ] ; [ %{A%}:LastVowel ] _ ( %>: ) \[ %>: | %{U%}: ] ;
>
>
{A}:Vy <=> [ :LastVowel :Cns* [ :Cns - й: ] ]/[ :0 | %>: ] _ [ ( %>: ) [ \[ %>: | %{U%}: ] | .#. ] | %>: %>: ] ; [ %{A%}:LastVowel ] _ [ ( %>: ) [ \[ %>: | %{U%}: ] | .#. ] | %>: %>: ] ;
  where LastVowel in ( и ү е э ө я а ё о ы ю у ) Vy in ( е ө е е ө а а о о а а а ) matched ;
Changed:
<
<
The right context ( %>: ) \[ %>: | %{U%}: ] states the restriction, that the rule applies only when {A} is not followed by an optional >: and {U}:. The right context is quite tricky and can easily be formulated incorrectly. In cases where the prohibited right context is more complicated, it will be difficult to find the correct way to prevent it. Hence you can use negative contexts, to restrict the application of twol-level rules.
>
>
The right context [ ( %>: ) [ \[ %>: | %{U%}: ] | .#. ] | %>: %>: ] states the restriction, that the rule applies only when {A} is not followed by (an optional) >: and {U}:. The right context is quite tricky, because we have to take into account the possibility that {A} is followed by two morpheme boundaries (which could happen in some marginal cases) and that {A}: may be the last pair in a word. This context is so tricky, that it can easily be formulated incorrectly. In cases where the prohibited right context is more complicated, it will be very difficult to find the correct way to prevent it. Hence HfstTwolC allows for using negative contexts to restrict the application of twol-level rules.
  Using negative contexts, the rule "Vowel harmony for archiphoneme {A}" can be formulated by referring directly to the prohibited right context [ ( %>: ) %{U%}: ]
 "Vowel harmony for archiphoneme {A}"
Line: 366 to 366
  Vy in ( е ө е е ө а а о о а а а ) matched ;
Added:
>
>
Often negative rule contexts and conflict reolution will produce the same result, but conflict resolution requires that the conflicting rules can be formulated in such a way that they form a chain of subcases. If this is not the case, an unresolvable conflict arises and conflict resolution cannot apply. In such cases negative contexts can be used to restrict one or more of the rules so that their context becomes disjoint from the context of the conflicting rule, which means that no rule conflict arises.
 Negative contexts should be enclosed in negative context brackets ~~[ ... ]~~ and negative contexts can only occur at the end of the rule context list after all regular rule contexts.

Regular expression center rules

Revision 562011-08-12 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 345 to 345
  Sometimes it is easier to formulate rule contexts as the difference of two contexts. In such cases it's possible to use negative contexts.
Changed:
<
<
Say, we want to limit the change x:y to occur after a symbol z, but don't want it to occur between two z symbols. This can be accomplished using a regular rule
x:y <=> z _ \z
>
>
In a description of the phonology of the Kyrgyz language, the preceding surface vowel determines the realization of the archiphoneme {A}, but the realization also depends on whether an archiphoneme {U} and an optional morpheme boundary follow. The two level rule which governs the realization of archiphoneme {A}, when there is no following archiphoneme {U} looks like
 "Vowel harmony for archiphoneme {A}"
{A}:Vy  <=> [ :LastVowel :Cns* [ :Cns - &#1081;: ] ]/[ :0 | %>: ] _ ( %>: ) \[ %>: | %{U%}: ] ;
                                              [ %{A%}:LastVowel ] _ ( %>: ) \[ %>: | %{U%}: ] ;

                               where LastVowel in (  &#1080;  &#1199;  &#1077;  &#1101;  &#1257;  &#1103;  &#1072;  &#1105;  &#1086;  &#1099;  &#1102;  &#1091;  )
                                            Vy in (  &#1077;  &#1257;  &#1077;  &#1077;  &#1257;  &#1072;  &#1072;  &#1086;  &#1086;  &#1072;  &#1072;  &#1072;  )
                               matched ;
 
Changed:
<
<
We can also formulate an equivalent rule using negative contexts
x:y <=>     z _       ;
        ~~[ z _ z ]~~ ;
>
>
The right context ( %>: ) \[ %>: | %{U%}: ] states the restriction, that the rule applies only when {A} is not followed by an optional >: and {U}:. The right context is quite tricky and can easily be formulated incorrectly. In cases where the prohibited right context is more complicated, it will be difficult to find the correct way to prevent it. Hence you can use negative contexts, to restrict the application of twol-level rules.

Using negative contexts, the rule "Vowel harmony for archiphoneme {A}" can be formulated by referring directly to the prohibited right context [ ( %>: ) %{U%}: ]

 "Vowel harmony for archiphoneme {A}"
{A}:Vy  <=> [ :LastVowel :Cns* [ :Cns - &#1081;: ] ]/[ :0 | %>: ] _ ;
                                              [ %{A%}:LastVowel ] _  ;
                                                              ~~[ _ ( %>: ) %{U%}: ]~~ ;

                               where LastVowel in (  &#1080;  &#1199;  &#1077;  &#1101;  &#1257;  &#1103;  &#1072;  &#1105;  &#1086;  &#1099;  &#1102;  &#1091;  )
                                            Vy in (  &#1077;  &#1257;  &#1077;  &#1077;  &#1257;  &#1072;  &#1072;  &#1086;  &#1086;  &#1072;  &#1072;  &#1072;  )
                               matched ;
 
Changed:
<
<
The rule limits the change to all contexts which match z _ but do not match z _ z.
>
>
Negative contexts should be enclosed in negative context brackets ~~[ ... ]~~ and negative contexts can only occur at the end of the rule context list after all regular rule contexts.
 

Regular expression center rules

Changed:
<
<
Some languages incorporate alternations, which are difficult to describe using regular twolc rules, which only conern a single symbol pair. It can e.g. be cumbersome to describe a choice of affix which is conditioned on phonological context, when the the affixes consist of multiple symbols. Such penomena are more conveniently described using rules with regular expression centers.
>
>
Some languages incorporate alternations, which are difficult to describe using regular twolc rules, which only concern a single symbol pair. It can e.g. be cumbersome to describe a choice of affix which is conditioned on phonological context, when the the affixes consist of multiple symbols. Such penomena are more conveniently described using rules with regular expression centers.
  The following grammar describes the choice of the prefix signifying 1st person present form of a verb in Ojibwe.
Alphabet

Revision 552011-08-12 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 341 to 341
 a:b => b _ a ;
Added:
>
>

Rules with negative contexts

Sometimes it is easier to formulate rule contexts as the difference of two contexts. In such cases it's possible to use negative contexts.

Say, we want to limit the change x:y to occur after a symbol z, but don't want it to occur between two z symbols. This can be accomplished using a regular rule

x:y <=> z _ \z
We can also formulate an equivalent rule using negative contexts
x:y <=>     z _       ;
        ~~[ z _ z ]~~ ;
The rule limits the change to all contexts which match z _ but do not match z _ z.
 

Regular expression center rules

Some languages incorporate alternations, which are difficult to describe using regular twolc rules, which only conern a single symbol pair. It can e.g. be cumbersome to describe a choice of affix which is conditioned on phonological context, when the the affixes consist of multiple symbols. Such penomena are more conveniently described using rules with regular expression centers.

Revision 542011-06-13 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 345 to 345
  Some languages incorporate alternations, which are difficult to describe using regular twolc rules, which only conern a single symbol pair. It can e.g. be cumbersome to describe a choice of affix which is conditioned on phonological context, when the the affixes consist of multiple symbols. Such penomena are more conveniently described using rules with regular expression centers.
Changed:
<
<
The following example describes the choice of the prefix signifying 1st person present form of a verb in Ojibwe.
>
>
The following grammar describes the choice of the prefix signifying 1st person present form of a verb in Ojibwe.
 
Alphabet

a b c d e f g h i j k l m
Line: 396 to 396
 "1st person person present prefix before a vowel is ind." <[ IND ]> <==> _ Vowel ;
Added:
>
>
The grammar will generate the following 1st person present forms out of the verb baseforms nibaa (to sleep), anokii (to work), dagoshin (to arrive) and bakade (to be hungry).
<PRES> <PRES> <PRES> n i b a a       --> n i 0 n i b a a
<PRES> <PRES> <PRES> a n o k i i     --> i n d a n o k i i
<PRES> <PRES> <PRES> d a g o s h i n --> i n 0 d a g o s h i n
<PRES> <PRES> <PRES> b a k a d e     --> i m 0 b a k a d e
  The syntax for regular expression center rules is similar to the syntax of ordinary rules, but the center needs to be enclosed in brackets <[ ... ]> and the rule operators look slightly different
==>, <==, <==>, /<==
Line: 404 to 411
  Warning, important Regular expression center rules do not participate in conflict resolution. Rule conflicts resulting from regular expression center rules are not detected by hfst-twolc.
Changed:
<
<
Warning, important If you use regular expressions with stars or plusses or pairs 0:x, the rules get very difficult to understand, so it's probably best to limit yourself to relatively simple center languages.
>
>
Warning, important If you use regular expressions with stars, plusses or pairs 0:x in the center, the rules become very difficult to understand, so it's probably best to use relatively simple center languages.
 

Weighted rules (NOT FULLY IMPLEMENTED YET)

Revision 532011-06-13 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 343 to 343
 

Regular expression center rules

Changed:
<
<
Some languages incorporate alternations, which are difficult to describe using rules concerning a single pair. One example is the flipping of adjacent sounds for inflection. E.g. the baseform of a word could be a b c and the inflected form a c b. Such penomena can be modelled using ordinary twolc rules, but they are more conveniently modelled using rules with regular expression centers.
>
>
Some languages incorporate alternations, which are difficult to describe using regular twolc rules, which only conern a single symbol pair. It can e.g. be cumbersome to describe a choice of affix which is conditioned on phonological context, when the the affixes consist of multiple symbols. Such penomena are more conveniently described using rules with regular expression centers.
 
Changed:
<
<
The following example describes the flipping of b and c
"Flip b and c after marker."
<[ b:c c:b ]> <==> <FLIP>:0 _ # ;
This rule performs the transformation
a <FLIP> b c --> a 0 c b

The syntax for regular expression center rules is similar to the syntax of ordinary rules, but the center needs to be enclosed in brackets <[ ... ]> and the rule operators look slightly different

==>, <==, <==>, /<==
Their semantics is the same as for ordinary twolc rules.

In order to not transform b:s to c:s and vice versa in incorrect contexts, we need another rule which restricts the transformation into the context following a flip marker

"Restrict flip."
<[ b:c | c:b ]> ==> <FLIP>:0 ?* _ ?* # ;
>
>
The following example describes the choice of the prefix signifying 1st person present form of a verb in Ojibwe.
Alphabet
 
Changed:
<
<
Warning, important Regular expression center rules do not participate in conflict resolution. Rule conflicts resulting from regular expression center rules are not detected by hfst-twolc.
>
>
a b c d e f g h i j k l m n o p q r t u v w x y z zh ;
 
Changed:
<
<
Besides flipping phonemes, regular expression center rules can be used to model circumfixation. The following grammar will match a prefix and suffix of a word. It accepts e.g.
<PRE>:0 <AF>:a <AF>:b <AF>:c <PRE>:0 x <SUF>:0 <AF>:a <AF>:b <AF>:c <SUF>:0 
since the prefix and suffix are the same, namely
<AF>:a <AF>:b <AF>:c
It will not accept
<PRE>:0 <AF>:a <AF>:b <AF>:c <PRE>:0 x <SUF>:0 <AF>:c <AF>:d <AF>:0 <SUF>:0
since the prefix
<AF>:a <AF>:b <AF>:c
doesn't match the suffix
<AF>:c <AF>:d <AF>:0 
>
>
Sets
 
Changed:
<
<
Alphabet
;
>
>
Vowel = a e i o u y ; AlveolarAndG = d g j z zh ;
  Definitions
Changed:
<
<
! List of possible affixes. ABC = :a :b :c; CD = :c :d :0;
>
>
! The possible first person present form markers are ind-, in-, im- and ni-. ! The choice of prefix is conditioned on phonological context.

IN = :i :n :0 ; IM = :i :m :0 ; IND = :i :n :d ; NI = :n :i :0 ;

 
Changed:
<
<
Affix = [ ABC | CD ] ;
>
>
! The list of all possible prefixes. PREFIX = [ IN | IM | IND | NI ] ;
  Rules
Changed:
<
<
"Possible affixes." <[ Affix ]> <==
:0 _ 
:0 ;
                :0 _ :0 ;

! Suffixes and prefixes have to match. E.g. "abcxabc" is possible since the
! prefix and suffix "abc" match. "cdxabc" is not possible since the prefix "cd"
! doesn't match the suffix "abc".

>
>
! We declare that " " can only be realized as ! "i n d", "i n", "i m" or "n i".

"1st person present prefixes." <[ PREFIX ]> <== _ ;

! The following rules restrict ! ! - ":i :n :0" to contexts where an alveolar consonant or ! "g" follows, ! - ":i :m :0" to contexts where "b" follows and ! - ":i :n :d" to contexts where a vowel follows. ! ! The default variant of the 1st person present marker is ! ":n :i :0".

"1st person person present prefix before an alveolar consonant or g is in." <[ IN ]> <==> _ AlveolarAndG ;

"1st person person present prefix before b in im." <[ IM ]> <==> _ b: ;

 
Changed:
<
<
"Match prefix and suffix abc." <[
:0 ABC 
:0 ]> <==> _ ?* :0 ABC :0 ;

>
>
"1st person person present prefix before a vowel is ind." <[ IND ]> <==> _ Vowel ;
 
Changed:
<
<
"Match prefix and suffix cd." <[
:0 CD  
:0 ]> <==> _ ?* :0 CD  :0 ;

>
>
The syntax for regular expression center rules is similar to the syntax of ordinary rules, but the center needs to be enclosed in brackets <[ ... ]> and the rule operators look slightly different
==>, <==, <==>, /<==
 
Added:
>
>
Their semantics is the same as for ordinary twolc rules.

Warning, important Regular expression center rules do not participate in conflict resolution. Rule conflicts resulting from regular expression center rules are not detected by hfst-twolc.

  Warning, important If you use regular expressions with stars or plusses or pairs 0:x, the rules get very difficult to understand, so it's probably best to limit yourself to relatively simple center languages.

Revision 522011-06-13 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 341 to 341
 a:b => b _ a ;
Changed:
<
<

Rules with regular expression centers

>
>

Regular expression center rules

 
Added:
>
>
Some languages incorporate alternations, which are difficult to describe using rules concerning a single pair. One example is the flipping of adjacent sounds for inflection. E.g. the baseform of a word could be a b c and the inflected form a c b. Such penomena can be modelled using ordinary twolc rules, but they are more conveniently modelled using rules with regular expression centers.

The following example describes the flipping of b and c

"Flip b and c after marker."
<[ b:c c:b ]> <==> <FLIP>:0 _ # ;
This rule performs the transformation
a <FLIP> b c --> a 0 c b

The syntax for regular expression center rules is similar to the syntax of ordinary rules, but the center needs to be enclosed in brackets <[ ... ]> and the rule operators look slightly different

==>, <==, <==>, /<==
Their semantics is the same as for ordinary twolc rules.

In order to not transform b:s to c:s and vice versa in incorrect contexts, we need another rule which restricts the transformation into the context following a flip marker

"Restrict flip."
<[ b:c | c:b ]> ==> <FLIP>:0 ?* _ ?* # ;

Warning, important Regular expression center rules do not participate in conflict resolution. Rule conflicts resulting from regular expression center rules are not detected by hfst-twolc.

Besides flipping phonemes, regular expression center rules can be used to model circumfixation. The following grammar will match a prefix and suffix of a word. It accepts e.g.

<PRE>:0 <AF>:a <AF>:b <AF>:c <PRE>:0 x <SUF>:0 <AF>:a <AF>:b <AF>:c <SUF>:0 
since the prefix and suffix are the same, namely
<AF>:a <AF>:b <AF>:c
It will not accept
<PRE>:0 <AF>:a <AF>:b <AF>:c <PRE>:0 x <SUF>:0 <AF>:c <AF>:d <AF>:0 <SUF>:0
since the prefix
<AF>:a <AF>:b <AF>:c
doesn't match the suffix
<AF>:c <AF>:d <AF>:0 

Alphabet
;

Definitions

! List of possible affixes.
ABC   = <AF>:a <AF>:b <AF>:c;
CD    = <AF>:c <AF>:d <AF>:0;

Affix = [ ABC | CD ] ;

Rules

"Possible affixes."
<[ Affix ]> <== <PRE>:0 _ <PRE>:0 ;
                <SUF>:0 _ <SUF>:0 ;

! Suffixes and prefixes have to match. E.g. "abcxabc" is possible since the
! prefix and suffix "abc" match. "cdxabc" is not possible since the prefix "cd"
! doesn't match the suffix "abc".

"Match prefix and suffix abc."
<[ <PRE>:0 ABC <PRE>:0 ]> <==> _ ?* <SUF>:0 ABC <SUF>:0 ;

"Match prefix and suffix cd."
<[ <PRE>:0 CD  <PRE>:0 ]> <==> _ ?* <SUF>:0 CD  <SUF>:0 ;

Warning, important If you use regular expressions with stars or plusses or pairs 0:x, the rules get very difficult to understand, so it's probably best to limit yourself to relatively simple center languages.

 

Weighted rules (NOT FULLY IMPLEMENTED YET)

Revision 512011-06-13 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 341 to 341
 a:b => b _ a ;
Added:
>
>

Rules with regular expression centers

 

Weighted rules (NOT FULLY IMPLEMENTED YET)

It may become possible to add weights to rules, which determine the relative importance of a rule in a conflict-situation. At this time it is only possible to compile weighted rules with zero weights.

Revision 502011-03-17 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 134 to 134
  The first part specifies the alphabet of the rules. The alphabet consists of pairs of an input-character and an output-character like a:a or N:m. If the input-character and output-character are the same, it is customary to denote the pair by the input-symbol, so a:a is usually written a. The alphabet is one statement so it is terminated by a semi-colon.
Deleted:
<
<
All symbols referred to in the rules have to be declared in the alphabet. An error message will be issued for undeclared symbols. The null symbol 0 is not declared separately and the word boundary symbol # is part of the alphabet by default and should not be included in the alphabet declaration (see below for more information on these two symbols). (The word boundary symbol is shown as @#@ when you print rule transducers using e.g. HfstFst2Txt.)
 Any non-empty string of non-white-space UTF-8 characters, that isn't a reserved word, is a valid alphabet-character. For now this means, that the characters shouldn't contain newlines, spaces, tabs or carriage-returns and shouldn't be found in the section List of reserved words below.

An example of an alphabet is

Line: 153 to 151
  (HfstTwolC represents the null symbol (epsilon) internally as @0@, e.g. when using HfstFst2Txt, the null symbol will be displayed as @0@ and the digit zero is displayed as a plain 0. Note that the null symbol here is not the quite the same epsilon which is used in replace rules in a cascade. In two-level rules, it is a place-holder which mapped into an epsilon after the virtual intersecting of the rules.)

Changed:
<
<

Implicit word-boundary #

Word boundaries # are understood to occur at the beginning and at the end of the lexical level of each string. Thus, the rules may refer to the beginning and the end of each word by writing #: as the first or the last item in a context. These implicit boundaries are not written in the entries of HfstLexC lexicons. However, you may also use the hash symbol as a normal character in the two-level rules by quoting it with a percent sign, %#. The normal hash character can be used in HfstLexC as well (where it need not be quoted). The hash sign or any other signs can be used e.g. for boundaries within word-forms.
>
>

Implicit word-boundary .#.

Word boundaries .#. are understood to occur at the beginning and at the end of the lexical level of each string. Thus, the rules may refer to the beginning and the end of each word by writing .#. as the first or the last item in a context. These implicit boundaries are not written in the entries of HfstLexC lexicons. Note thet .#. doesn't refer to any specific symbol in HfstLexc lexicons. It signifies the absolute beginning and end of a string. You can declare your own word boundaries in HfstLexc and HfstTwolc (e.g. #), but these are just symbols.
  (The implicit word boundary is represented as @#@ in rule transducers and the quoted symbol hash sign %# is represented as # in rule transducers as displayed by HfstFst2Txt. When HfstComposeIntersect combines a lexicon transducer and set of two-level rules, it inserts a @#@ at the very beginning and end of the lexicon before doing the combined operation of intersecting and composing. Thus, you do not write it explicitly it in your lexicon.)

Revision 492010-12-21 - TommiPirinen

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Revision 482010-11-29 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Added:
>
>
Will be updated shortly when hfst3 becomes available.
 

Purpose

Revision 472010-03-24 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 205 to 205
  Sets may be used in rules as a short-hand for collections of character-pairs.
Changed:
<
<
Perhaps one might want to write a rule, which states, that the phoneme t is realised as its voiced fricative counter-part ө between two phonemes, which are realised as vowels. This could be accomplished by the rule
>
>
Perhaps one might want to write a rule, which states, that the phoneme t is realised as its voiceless fricative counter-part ө between two phonemes, which are realised as vowels. This could be accomplished by the rule
 
t:&#1257; <= :Vowel _ :Vowel ;
Line: 259 to 259
 
I:j => :V _ :V ;
Changed:
<
<
It states, that the input-character I can be realised as j only in a contex, where it is surrounded by output vowels. The rule doesn't constrain the distribution of any other pairs I:X, nor does it constrain the distribution of pairs X:j, where X is something else than I. It simply states, that if the pair I:j occurs, it has to occur between two output vowels.
>
>
It states, that the input-character I can be realised as j only in a contex, where it is surrounded by output vowels. The rule doesn't constrain the distribution of any other pairs I:X, nor does it constrain the distribution of pairs X:j, where X is something other than I. It simply states, that if the pair I:j occurs, it has to occur between two output vowels.
  The context :V _ :V in the example is automatically extended to a so called total context, by hfst-twolc. This means that, when the rule is compiled, the context will become ?* :V _ :V ?*. This applies to all kinds of rule-operators.

Revision 462010-03-20 - KimmoKoskenniemi

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 113 to 113
 

Operator Precedence

Changed:
<
<
The operators in htwolc have different precedence. A rule of thumb for precedence: unary operators have the strongest bind, then concatenation and finally binary operators. The constructions [ ... ] and ( ... ) override other precedences.
>
>
The operators in HfstTwolC have different precedence. A rule of thumb for precedence: unary operators have the strongest bind, then concatenation and finally binary operators. The constructions [ ... ] and ( ... ) override other precedences.
  Operators ordered by precedence from strongest to weakest:
Line: 132 to 132
  The first part specifies the alphabet of the rules. The alphabet consists of pairs of an input-character and an output-character like a:a or N:m. If the input-character and output-character are the same, it is customary to denote the pair by the input-symbol, so a:a is usually written a. The alphabet is one statement so it is terminated by a semi-colon.
Changed:
<
<
Every symbol referred to in some of the rules except epsilon (0) and the word-boundary (#), has to be declared in the alphabet. Otherwise an error message will be issued. A word boundary symbol # is inserted into the alphabet by default. This symbol is visible as @#@ when you print rule transducers using e.g. HfstFst2Txt.
>
>
All symbols referred to in the rules have to be declared in the alphabet. An error message will be issued for undeclared symbols. The null symbol 0 is not declared separately and the word boundary symbol # is part of the alphabet by default and should not be included in the alphabet declaration (see below for more information on these two symbols). (The word boundary symbol is shown as @#@ when you print rule transducers using e.g. HfstFst2Txt.)
  Any non-empty string of non-white-space UTF-8 characters, that isn't a reserved word, is a valid alphabet-character. For now this means, that the characters shouldn't contain newlines, spaces, tabs or carriage-returns and shouldn't be found in the section List of reserved words below.
Line: 144 to 144
 ! Characters consist of strings of utf-8 characters. No white-space, though! a b c d e f g h i j k l m n o p q r s t u v w x y z å ä ö N:n N:m ;
Changed:
<
<

Epsilon 0

You may use pairs X:0 and 0:X to denote deletions and insertions. Please note that insertion and deletion pairs are just like all other pairs.
>
>

The null symbol (epsilon) 0

Two-level rules express deletions using a null symbol, e.g. by e:0. A zero denotes the null symbol. Epenthesis (or insertion) is denoted, likewise, by e.g. 0:o. In two-level rules such pairs with a null symbol behave much like any other pairs.
 
Changed:
<
<
If your grammar contains the symbol 0, you should refer to it %0.
>
>
Sometimes one needs to refer to the digit zero instead of the null symbol. The digit zero should be quoted, i.e. written as %0 in the alphabet declaration and in rules etc.
 
Changed:
<
<
HfstTwolC represents epsilon internally as @0@, since it isn't possible to have a symbol 0 in the grammar otherwise. E.g. when using HfstFst2Txt, epsilon will be displayed as @0@. ((Note that the epsilon here is actually a symbol for the compiler rather than the same kind of epsilon used in replace rules in a cascade. In two-level rules, it is a place-holder which mapped into an epsilon after the virtual intersecting of the rules.))
>
>
(HfstTwolC represents the null symbol (epsilon) internally as @0@, e.g. when using HfstFst2Txt, the null symbol will be displayed as @0@ and the digit zero is displayed as a plain 0. Note that the null symbol here is not the quite the same epsilon which is used in replace rules in a cascade. In two-level rules, it is a place-holder which mapped into an epsilon after the virtual intersecting of the rules.)
 
Changed:
<
<

Word-Boundary #

You may use # to refer to word-boundaries. The word-boundary refers to the absolute beginning and end of a word-form. If you want to use bondaries between the parts of compound-words, you need to choose a different symbol (e.g. +) and see to it that your HfstLexC lexicon inserts the separator symbol between the parts of a compound.
>
>

Implicit word-boundary #

Word boundaries # are understood to occur at the beginning and at the end of the lexical level of each string. Thus, the rules may refer to the beginning and the end of each word by writing #: as the first or the last item in a context. These implicit boundaries are not written in the entries of HfstLexC lexicons. However, you may also use the hash symbol as a normal character in the two-level rules by quoting it with a percent sign, %#. The normal hash character can be used in HfstLexC as well (where it need not be quoted). The hash sign or any other signs can be used e.g. for boundaries within word-forms.
 
Changed:
<
<
If you need the symbol # you should quote it using %, I.e. %#. The word-boundary is compiled into @#@ in rule transducers. This is a way of preventing the symbol # getting mixed up with a word-boundary. The quoted symbol %# is compiled into # in rule transducers.

An absolute word-boundary is by default appended to the beginning and to the end of HfstLexC lexicons by HfstComposeIntersect, so you don't need to declare it in your lexicon.

>
>
(The implicit word boundary is represented as @#@ in rule transducers and the quoted symbol hash sign %# is represented as # in rule transducers as displayed by HfstFst2Txt. When HfstComposeIntersect combines a lexicon transducer and set of two-level rules, it inserts a @#@ at the very beginning and end of the lexicon before doing the combined operation of intersecting and composing. Thus, you do not write it explicitly it in your lexicon.)
 

Diacritics

Changed:
<
<
The morpho-phonological description of a language may contain symbols, which
>
>
The morphophonological description of a language may contain symbols, which
 
  • act as triggers for certain phonological rules,
  • are irrelevant for all other rules and
Line: 185 to 183
 I:j <=> Vowel _ Vowel ; allows the correspondence v i i k .:0 k o .:0 I:j a despite the intervening pair .:0.
Changed:
<
<
Warning, important You shouldn't declare flag-diaritics used in HfstLexC lexicons as diacritics. These are handled using a different mechanism and needn't be mentioned in the HfstTwolC grammar.
>
>
Warning, important You shouldn't declare flag diaritics used in HfstLexC lexicons as diacritics. These are handled using a different mechanism and needn't be mentioned in the HfstTwolC grammar.
 

Rule-variables

This section exists, so that grammars which compiled under HfstTwolC 1.0 also compile under HfstTwolC 2.0. In HfstTwolC 1.0 rule variables had to be declared, but this isn't mandatory in HfstTwolC 2.0.

Revision 452010-03-15 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 132 to 132
  The first part specifies the alphabet of the rules. The alphabet consists of pairs of an input-character and an output-character like a:a or N:m. If the input-character and output-character are the same, it is customary to denote the pair by the input-symbol, so a:a is usually written a. The alphabet is one statement so it is terminated by a semi-colon.
Changed:
<
<
Every symbol referred to in some of the rules except epsilon (0) and the word-boundary (#), has to be declared in the alphabet. Otherwise an error message will be issued. ((Is the boundary symbol inserted automatically at the beginning and at the end of all strings?))
>
>
Every symbol referred to in some of the rules except epsilon (0) and the word-boundary (#), has to be declared in the alphabet. Otherwise an error message will be issued. A word boundary symbol # is inserted into the alphabet by default. This symbol is visible as @#@ when you print rule transducers using e.g. HfstFst2Txt.
  Any non-empty string of non-white-space UTF-8 characters, that isn't a reserved word, is a valid alphabet-character. For now this means, that the characters shouldn't contain newlines, spaces, tabs or carriage-returns and shouldn't be found in the section List of reserved words below.
Line: 152 to 152
 HfstTwolC represents epsilon internally as @0@, since it isn't possible to have a symbol 0 in the grammar otherwise. E.g. when using HfstFst2Txt, epsilon will be displayed as @0@. ((Note that the epsilon here is actually a symbol for the compiler rather than the same kind of epsilon used in replace rules in a cascade. In two-level rules, it is a place-holder which mapped into an epsilon after the virtual intersecting of the rules.))

Word-Boundary #

Changed:
<
<
You may use # to refer to word-boundaries. If you need the symbol # you should quote it using %, I.e. %#. ((Comment: Is this # virtually inserted at the beginning and end of all strings so that the rules may refer to the initial and final positions? Can one use the same symbol in the lexicon e.g. between parts of compounds? Should one then refer to such an implicitly added boundary by writing %# in the expression in the rule? Or, does %# refer to a hash sign visible in actual text i.e. not the implicit boundary?))
>
>
You may use # to refer to word-boundaries. The word-boundary refers to the absolute beginning and end of a word-form. If you want to use bondaries between the parts of compound-words, you need to choose a different symbol (e.g. +) and see to it that your HfstLexC lexicon inserts the separator symbol between the parts of a compound.

If you need the symbol # you should quote it using %, I.e. %#. The word-boundary is compiled into @#@ in rule transducers. This is a way of preventing the symbol # getting mixed up with a word-boundary. The quoted symbol %# is compiled into # in rule transducers.

An absolute word-boundary is by default appended to the beginning and to the end of HfstLexC lexicons by HfstComposeIntersect, so you don't need to declare it in your lexicon.

 

Diacritics

Revision 442010-03-15 - KimmoKoskenniemi

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 149 to 149
  If your grammar contains the symbol 0, you should refer to it %0.
Changed:
<
<
HfstTwolC represents epsilon internally as @0@, since it isn't possible to have a symbol 0 in the grammar otherwise. E.g. when using HfstFst2Txt, epsilon will be displayed as @0@. ((Note that the epsilon here is actually a symbol for the compiler rather than the same kind of epsilon used in replace rules in a cascade. I two-level rules, it is a place-holder which mapped into an epsilon after the virtual intersecting of the rules.))
>
>
HfstTwolC represents epsilon internally as @0@, since it isn't possible to have a symbol 0 in the grammar otherwise. E.g. when using HfstFst2Txt, epsilon will be displayed as @0@. ((Note that the epsilon here is actually a symbol for the compiler rather than the same kind of epsilon used in replace rules in a cascade. In two-level rules, it is a place-holder which mapped into an epsilon after the virtual intersecting of the rules.))
 

Word-Boundary #

You may use # to refer to word-boundaries. If you need the symbol # you should quote it using %, I.e. %#. ((Comment: Is this # virtually inserted at the beginning and end of all strings so that the rules may refer to the initial and final positions? Can one use the same symbol in the lexicon e.g. between parts of compounds? Should one then refer to such an implicitly added boundary by writing %# in the expression in the rule? Or, does %# refer to a hash sign visible in actual text i.e. not the implicit boundary?))

Revision 432010-03-14 - KimmoKoskenniemi

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 132 to 132
  The first part specifies the alphabet of the rules. The alphabet consists of pairs of an input-character and an output-character like a:a or N:m. If the input-character and output-character are the same, it is customary to denote the pair by the input-symbol, so a:a is usually written a. The alphabet is one statement so it is terminated by a semi-colon.
Changed:
<
<
Every symbol referred to in some of the rules except epsilon (0) and the word-boundary (#), has to be declared in the alphabet. Otherwise an error message will be issued.
>
>
Every symbol referred to in some of the rules except epsilon (0) and the word-boundary (#), has to be declared in the alphabet. Otherwise an error message will be issued. ((Is the boundary symbol inserted automatically at the beginning and at the end of all strings?))
  Any non-empty string of non-white-space UTF-8 characters, that isn't a reserved word, is a valid alphabet-character. For now this means, that the characters shouldn't contain newlines, spaces, tabs or carriage-returns and shouldn't be found in the section List of reserved words below.
Line: 149 to 149
  If your grammar contains the symbol 0, you should refer to it %0.
Changed:
<
<
HfstTwolC represents epsilon internally as @0@, since it isn't possible to have a symbol 0 in the grammar otherwise. E.g. when using HfstFst2Txt, epislon will be displayed as @0@.
>
>
HfstTwolC represents epsilon internally as @0@, since it isn't possible to have a symbol 0 in the grammar otherwise. E.g. when using HfstFst2Txt, epsilon will be displayed as @0@. ((Note that the epsilon here is actually a symbol for the compiler rather than the same kind of epsilon used in replace rules in a cascade. I two-level rules, it is a place-holder which mapped into an epsilon after the virtual intersecting of the rules.))
 

Word-Boundary #

Changed:
<
<
You may use # to refer to word-boundaries. If you need the symbol # you should quote it using %, I.e. %#.
>
>
You may use # to refer to word-boundaries. If you need the symbol # you should quote it using %, I.e. %#. ((Comment: Is this # virtually inserted at the beginning and end of all strings so that the rules may refer to the initial and final positions? Can one use the same symbol in the lexicon e.g. between parts of compounds? Should one then refer to such an implicitly added boundary by writing %# in the expression in the rule? Or, does %# refer to a hash sign visible in actual text i.e. not the implicit boundary?))
 

Diacritics

The morpho-phonological description of a language may contain symbols, which

Changed:
<
<
  • act as ques for certain phonological rules to act,
>
>
  • act as triggers for certain phonological rules,
 
  • are irrelevant for all other rules and
Changed:
<
<
  • should not be present in the phonological representation of word-forms.
>
>
  • should not be present in the surface representation of word-forms.
  E.g. markers for syllable-boundaries or stress markers and all kinds of markers appended to word-forms by the lexicon may be such symbols.
Line: 235 to 235
 
  • character-pair (e.g. a:b),
  • a more general pair-construct of a single character (e.g. a: or :a),
  • a set construct like a:S, where S is a symbol set,
Changed:
<
<
  • or a disjunction of such centers (e.g. a:b | b: | c:d | a:S).
>
>
  • or a disjunction of such centers (e.g. a:b | b: | c:d | a:S). (Note that the list may not be enclosed in brackets which were allowed when using the Xerox twolc.)
  A context consists of two regular expressions (Li and Ri) separated by an underscore. Schematically
Line: 301 to 301
 

Rules with variables

Changed:
<
<
As an easy short-hand for defining (a possibly large) set of similar two-level rules, rule-variables have been included to hfst-twolc. Consider the following rule, which is needed for gradation of stops in finnish
>
>
As an easy short-hand for defining (a possibly large) set of similar two-level rules, rule-variables have been included to hfst-twolc. Consider the following rule, which is needed for gradation of stops in Finnish
 
"Gradation of k to '"
K:' <=> Vowel Vx _ Vx ClosedOffset ; where Vx in Vowel ;

Revision 422009-10-30 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Revision 412009-10-15 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 147 to 147
 

Epsilon 0

You may use pairs X:0 and 0:X to denote deletions and insertions. Please note that insertion and deletion pairs are just like all other pairs.

Added:
>
>
If your grammar contains the symbol 0, you should refer to it %0.

HfstTwolC represents epsilon internally as @0@, since it isn't possible to have a symbol 0 in the grammar otherwise. E.g. when using HfstFst2Txt, epislon will be displayed as @0@.

 

Word-Boundary #

You may use # to refer to word-boundaries. If you need the symbol # you should quote it using %, I.e. %#.

Revision 402009-10-14 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 88 to 88
  then the regular expression a N: e will match a N:m e and a N:n e.
Changed:
<
<
Regular expressions may be grouped together using the parenthesis-constructions [ ... ] and ( ... ). If R is a regular expression, then [ R ] matches exactly the same strings of pairs as R does. The construction ( R ), on the other hand, always matches the empty string, as well.
>
>
Regular expressions may be grouped together using the parenthesis-constructions [ ... ] and ( ... ). If R is a regular expression, then [ R ] matches exactly the same strings of pairs as R does. The construction ( R ), on the other hand, always matches the empty string, as well. Please note, that (...) implies optionality. Expressions (...) In e.g. perl regular expression syntax correspond to [...] in twolc syntax.
  Grouping becomes important, when one uses unary regular expression operators. Unary operators like * have higher precedence, than concatenation. This means that e.g. a b* is equivalent to [ a ] [ b * ]. If one wants the * operator to apply to the whole expression a b one has to group the expressions a and b together i.e. [ a b ]*.

Revision 392009-10-14 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 132 to 132
  The first part specifies the alphabet of the rules. The alphabet consists of pairs of an input-character and an output-character like a:a or N:m. If the input-character and output-character are the same, it is customary to denote the pair by the input-symbol, so a:a is usually written a. The alphabet is one statement so it is terminated by a semi-colon.
Changed:
<
<
Every symbol referred to in some of the rules, has to be declared in the alphabet. Otherwise an error message will be issued.
>
>
Every symbol referred to in some of the rules except epsilon (0) and the word-boundary (#), has to be declared in the alphabet. Otherwise an error message will be issued.
  Any non-empty string of non-white-space UTF-8 characters, that isn't a reserved word, is a valid alphabet-character. For now this means, that the characters shouldn't contain newlines, spaces, tabs or carriage-returns and shouldn't be found in the section List of reserved words below.
Line: 144 to 144
 ! Characters consist of strings of utf-8 characters. No white-space, though! a b c d e f g h i j k l m n o p q r s t u v w x y z å ä ö N:n N:m ;
Added:
>
>

Epsilon 0

You may use pairs X:0 and 0:X to denote deletions and insertions. Please note that insertion and deletion pairs are just like all other pairs.

Word-Boundary #

You may use # to refer to word-boundaries. If you need the symbol # you should quote it using %, I.e. %#.
 

Diacritics

The morpho-phonological description of a language may contain symbols, which

Revision 382009-10-13 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 535 to 535
 

Known bugs

Changed:
<
<
  • Warnigns for unequal value lists in variable-rules with keyword matched and mixed aren't working correctly. This may result in the compiler getting stuck in an endless loop.
>
>
  • Warnings for unequal value lists in variable-rules with keyword matched and mixed aren't working correctly. This may result in the compiler getting stuck in an endless loop. This is going to get fixed.
  • The word-boundary symbol # needs to be declared separately in the alphabet (or as a diacritic). This is going to get fixed.
 

References

Revision 372009-10-12 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 307 to 307
  matched; The rule states, that the morpho-phonemes K, P, T vanish, when they serve as the onset of a closed syllable and are preceded by a surface k, p or t respectively. Here the occurrences of the variable Cx are matched with those of Cy. For instance, nothing is said about an input K preceded by an output p. The rule is only concerned with input-level characters K preceded by output-level characters k.
Changed:
<
<
Occurences of variables are matched by default. If you don't want this to happen, you may either use several where parts to govern different variables or replace the keyword matched by freely.
>
>
Occurences of variables aren't matched by default. If you want them to be matched, you have to use the keyword matched. Other possible keywords are freely and mixed. These are most easily explained using examples.
 
Changed:
<
<

Generalized Context-Restrictions (NOT IMPLEMENTED YET)

>
>
The rule
 a:b => X _ Y; where X in (a b) Y in (a b) freely;

corresponds to the joint effect of the simple two-level rules

a:b => a _ a ;
a:b => a _ b ;
a:b => b _ a ;
a:b => b _ b ;
 
Changed:
<
<
Generalized context-restrictions allow the definition of rules with a more general center-language, than normal two-level rules. They also let the user constrain the application of a particular rule to some contexts.
>
>
The rule
 a:b => X _ Y; where X in (a b) Y in (a b) mixed;

corresponds to the joint effect of the simple two-level rules

a:b => a _ b ;
a:b => b _ a ;
 

Weighted rules (NOT FULLY IMPLEMENTED YET)

Line: 398 to 414
 

Logical Errors

Changed:
<
<
hfst-twolc currently only gives one kind of logical error. Let a grammar contain the following rule
>
>
hfst-twolc currently only gives two kinds of logical error.

Symbols, which are used in rules, but not declared in the alphabet, give a logical error.

Let a grammar contain the following rule

 
"Geminate gradation"
Cx:0 <=> :Cy _ ClosedCoda ; where Cx in ( K P T )
Line: 411 to 431
  matched; ^ HERE Aborted.
Added:
>
>
Warning, important Warnings about unequally long value lists aren't working porperly.
 
Deleted:
<
<

Warnings

The following is an example of a warning given by hfst-twolc

WARNING! LINE 7:
[1] The pair a:b wasn't declared in the alphabet!
a:b <= c _ ;
  ^ HERE
The program attempts to report the number of the line, which gives the warning and also point to the place, which gives the warning. Note, that the place and line given may not be accurate. When they're not, the problem is often on the previous line.

The number [1] means, that this is a warning of type 1. There are seven types of warnings. These are

[1] A pair X:Y was used in the grammar, but it wasn't declared in the alphabet and neither X, nor Y was the name of a set.

[2] The same set is defined twice, or a set is defined, which has the same name as a symbol in the alphabet.

[3] The same definition is declared twice, or a defintiion has the same name as a set or a symbol in the alphabet.

[4] The same rule-name is used for two rules.

[5] A construct X: or :X was used, where X is a symbol or a set. The expression didn't match a single pair in the alphabet.

[6] The construction R^i was used, where i was not a positive integer.

[7] Warning for a pair x:y where x is a diacritic and y is non-zero. Diacritics are always realised as zero, so y will be discarded.

 

Resolution of Conflicts between the Rules

Line: 444 to 440
  A situation, where one rule accepts a pair-string and another rejects it, shouldn't always be regarded as a conflict. In hfst-twolc it is regarded as a conflict, only if both of the rules are actually applied in the sense discussed in Yli-Jyrä and Koskenniemi 2006. Normal rule-interaction constrains the surface-realizations of some input-form, but do not loose all of them. In contrast to this rule-conflicts often filter away some input-forms completely. There are many kinds of conflicts, but for the time-being only right-arrow conflicts and left-arrow-conflicts are automatically resolved by hfst-twolc.
Changed:
<
<
Unless hfst-twolc is run with the commandline-parameter --no-report, it will report all rule-conflicts, it observes and if it is run with the parameter --resolve, it will resolve the conflicts.
>
>
Unless hfst-twolc is run with the commandline-parameter --silent, it will report all rule-conflicts. It always resolves right-arrow conflicts and it resolves left-arrow conflicts if it is run with the parameter --resolve.
  The examples given below of right-arrow and left-arrow conflicts are very similar to those given in Karttunen, Koskenniemi and Kaplan 1987.
Line: 535 to 531
  The words and constructs may be used in rules by quoting with %. E.g. %? means question-mark, not any character-pair defined in the alphabet and %Sets is an ordinary name Sets not a declaration, that definitions of sets will follow. In the previous example %Sets could be used as a character in the alphabet, the name of a regular expression in the definition section of the grammar or the name of a set.
Changed:
<
<

A Test-Tool for Grammars

Differences from Xerox twolc

This section contains a list of features, which differ between hfst-twolc and Xerox twolc.

Unimplemented Features in hfst-twolc

This list contains features which, for the time being, are lacking from hfst-twolc, but will be added, or have been implemented differently from Xerox twolc, but will be changed. The missing features are gathered from Karttunen and Koskenniemi 1987 and Karttunen 1992.

Partial implementations in hfst-twolc

Since this is an alpha-version of hfst-twolc, there are many features, that have limited functionality.

The where ... ( matched | freely | mixed ) construction is implemented, but is partial in some respects. You can either write a rule with a variable Vx

"Gradation of k to '"
%^K:' <=> Vowel Vx _ Vx ClosedOffset ;
             where Vx in Vowel ;
or write
"Gradation of k to '"
%^K:' <=> Vowel Vx _ Vx ClosedOffset ;
             where Vx in ( a e i o u y ä ö ) ;
but you can't embed the Vowel set in the range, i.e. rules like
"Gradation of k to '"
%^K:' <=> Vowel Vx _ Vx ClosedOffset ;
             where Vx in (Vowel) ;
don't work.

There is no support for either the freely or mixed options. E.g.

X:Y => a _ ; where X in (s t) Y in (u v);
means the same as
X:Y => a _ ; where X in (s t) Y in (u v) matched;
i.e. is equivalent to the intersection of the rules
s:u => a _ ;
t:v => a _ ; 
Though there is no support for freely, the option can easily be simulated by writing the rule
X:Y => a _ ; where X in (s t) and Y in (u v);
This makes the rule equivalent to the intersection of the rules
s:u => a _ ;
t:v => a _ ;
s:v => a _ ;
t:u => a _ ; 

Rules defined with variables, may easily come into conflict with eachother. For now this is treated as any other rule-conflict. Consider the rule

x:y => A _ A ; where A in (s t);
The subcases
x:y => s _ A ;
x:y => t _ A ;
are in a right-arrow conflict with each-other. This is easily solved by conflict-resolution. The case of left-arrow rules is less fortunate. They may easily come into unresolvable conflict with each-other, when the center involves variables.

Conflict-resolution may be very slow.

Substitution of values for variables may produce new pairs , which haven't been declared in the alphabet. For now hfst-twolc can only warn about such new pairs occuring on the left side of the rule-operator.

The OpenFst-implementation may be very slow.

Permanent differences from Xerox twolc

This list contains features, which are intended to differ from corresponding features in the Xerox twolc program.

>
>
Warning, important HfstTwolC reserves symbol-names beginning with two underscores for internal use.
 
Added:
>
>

Known bugs

 
Changed:
<
<
  • All valid character-pairs should be declared in the Alphabet. Other character-pairs may be used in the rules, but this will raise a warning. The construction ? (and corresponding constructions) in regular expressions only matches character-pairs, which have been declared in the Alphabet.
  • All rule-variables have to be declared in the Rule-variables section in the header of the grammar.
>
>
  • Warnigns for unequal value lists in variable-rules with keyword matched and mixed aren't working correctly. This may result in the compiler getting stuck in an endless loop.
 

References

Revision 362009-10-12 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 38 to 38
 

Syntax

Changed:
<
<
A twol-grammar consists of five parts: Alphabet, Diacritics, Sets, Definitions and Rules. Each part contains statements, that end in a ; character and comments, that begin with a ! character and span to the end of the line.
>
>
A twol-grammar consists of five parts: Alphabet, Diacritics, Sets, Definitions and Rules. Each part contains statements, that end in a ; character and comments, that begin with a ! character and span to the end of the line. There is a fifth optional part Rule-variables, which declares variables used in the rules.
 
Alphabet
Line: 79 to 79
 
  • a:0 and 0:a correspond to deletion and insertion of a.
  • 0 matches the empty string (this is probably useless...).
Changed:
<
<
Warning, important When you use constructions like :a, make sure to surround them with white-space, i.e. use ( :a) not (:a) and ( : ) not (:). Omitting white-space will break the scanning of the grammar (this might be fixed in the future).
>
>
Warning, important When you use constructions like :a, make sure to surround them with white-space, i.e. use ( :a) not (:a) and ( : ) not (:). Omitting white-space might break the scanning of the grammar (this might be fixed in the future).
  By concatenating pairs, one can build longer regular expressions matching strings of pairs. If the alphabet is declared
Line: 126 to 126
 
[  [ ~[ a ^ 3]  ] b ] | [ c [ d* ]  ]
Added:
>
>
When in doubt about, which operator binds the strongest, use brackets $[ ... ]$
 

The Alphabet

The first part specifies the alphabet of the rules. The alphabet consists of pairs of an input-character and an output-character like a:a or N:m. If the input-character and output-character are the same, it is customary to denote the pair by the input-symbol, so a:a is usually written a. The alphabet is one statement so it is terminated by a semi-colon.

Line: 150 to 152
 
  • are irrelevant for all other rules and
  • should not be present in the phonological representation of word-forms.
Changed:
<
<
E.g. markers for syllable-boundaries and all kinds of markers appended to word-forms by the lexicon may be such symbols.
>
>
E.g. markers for syllable-boundaries or stress markers and all kinds of markers appended to word-forms by the lexicon may be such symbols.
  It's easiest to declare such symbols diacritics in hfst-twolc. This is done by mentioning them in the section Diacritics, which may look like
Line: 169 to 171
 I:j <=> Vowel _ Vowel ; allows the correspondence v i i k .:0 k o .:0 I:j a despite the intervening pair .:0.
Added:
>
>
Warning, important You shouldn't declare flag-diaritics used in HfstLexC lexicons as diacritics. These are handled using a different mechanism and needn't be mentioned in the HfstTwolC grammar.
 

Rule-variables

Changed:
<
<
This section exists, so that grammars which compiled under HfstTwolC 1.0 also compile under HfstTwolC 2.0. In HfstTwolC 1.0 rule variables needed to be declared, but this isn't mandatory in HfstTwolC 2.0.
>
>
This section exists, so that grammars which compiled under HfstTwolC 1.0 also compile under HfstTwolC 2.0. In HfstTwolC 1.0 rule variables had to be declared, but this isn't mandatory in HfstTwolC 2.0.
  Rules may contain variables. Any variable used, can be declared in the Rule-variables section.
Line: 193 to 197
 
t:&#1257; <= :Vowel _ :Vowel ;
Changed:
<
<
The construction :Vowel will match any pair, used in some rule, where the output symbol is a vowel.
>
>
The construction :Vowel will match any pair, used in some rule, where the output symbol is a vowel. Please, note that also pairs which result from repolacinf variables with their values in rules add to set constructions. Consider the following vowel-harmony rule regarding the archiphonemes %^A, %^O and %^U
VMP:Vx <=> BackVowel :* _ ; where VMP in (%^A %^O %^U) Vx in (a o u) matched; 
The pairs %^A:a, %^O:o and %^U:u will match :Vowel regardless of whether the the pairs have been declared in the alphabet.
  It is possible to define a set having the same name as an alphabet character. There is no guarantee what will happen, if this is done.

Revision 352009-10-08 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 42 to 42
 
Alphabet
Changed:
<
<
! The alphabet should contain symbols which are used in the grammar. ! Characters consist of strings of utf-8 characters. No white-space, though!
>
>
! The alphabet should contain all symbols which are used in the grammar. ! Symbols consist of strings of utf-8 characters. Reserved words and white-space ! need to be quoted using %.
 a b c d e f g h i j k l m n o p q r s t u v w x y z å ä ö N:m N:n ;

Sets

Line: 78 to 79
 
  • a:0 and 0:a correspond to deletion and insertion of a.
  • 0 matches the empty string (this is probably useless...).
Changed:
<
<
Warning, important When you use constructions like :a, make sure to surround them with white-space, i.e. use ( :a) not (:a) and ( : a) not (: a). Omitting white-space will break the scanning of the grammar (this might be fixed in the future).
>
>
Warning, important When you use constructions like :a, make sure to surround them with white-space, i.e. use ( :a) not (:a) and ( : ) not (:). Omitting white-space will break the scanning of the grammar (this might be fixed in the future).
  By concatenating pairs, one can build longer regular expressions matching strings of pairs. If the alphabet is declared
Line: 93 to 94
  There are seven unary regular-expression operators in hfst-twolc for the time being. Let the Alphabet be = a N:n N:m o= and let R denote a regular expression. The unary operators are:
Changed:
<
<
  • The power-operator ^INTEGER, which is equivalent to concatenation of the argument-expression with itself INTEGER times. E.g. a^3 is equivalent to a a a (NOT IMPLEMENTED for some reason... coming soon).
>
>
  • The power-operator ^INTEGER, which is equivalent to concatenation of the argument-expression with itself INTEGER times. E.g. a^3 is equivalent to a a a.
 
  • The containment-operator $. The regular-expression $R matches any string containing at least one substring matched by R. E.g. $a is equivalent to [ a N:n N:m e ]* a [ a N:n N:m e]*, using the alphabet defined above.
  • The exact containment-operator $. is similar to the containment operator, but the mathcing strings have to contain exactly one substring matching R. E.g. $.a is equivalent to [ N:n N:m e ]* a [ N:n N:m e]* using the Alphabet defined above.
  • The term-complement-operator \. The term-complement of R is the language \R containing every pair, that is not matched by R. E.g. \a is equivalent to [ N:n N:m e ] with the Alphabet defined above. Note that the term-complement is not the same thing as the negation of a language.
Line: 169 to 170
 allows the correspondence v i i k .:0 k o .:0 I:j a despite the intervening pair .:0.

Rule-variables

Changed:
<
<
This section exists, so that grammars which compiled under HfstTwolC 1.0 also compile under HfstTwolC 2.0. In HfstTwolC 1.0 rule variables needed to be declared, but this isn't madatory in HfstTwolC 2.0.
>
>
This section exists, so that grammars which compiled under HfstTwolC 1.0 also compile under HfstTwolC 2.0. In HfstTwolC 1.0 rule variables needed to be declared, but this isn't mandatory in HfstTwolC 2.0.
  Rules may contain variables. Any variable used, can be declared in the Rule-variables section.
Line: 304 to 305
  Generalized context-restrictions allow the definition of rules with a more general center-language, than normal two-level rules. They also let the user constrain the application of a particular rule to some contexts.
Changed:
<
<

Weighted rules (NOT IMPLEMENTED YET)

>
>

Weighted rules (NOT FULLY IMPLEMENTED YET)

 
Changed:
<
<
It may become possible to add weights to rules, which determine the relative importance of a rule in a conflict-situation.
>
>
It may become possible to add weights to rules, which determine the relative importance of a rule in a conflict-situation. At this time it is only possible to compile weighted rules with zero weights.
 

Error-Messages and Warnings

Line: 314 to 315
 
  • don't conform to the syntax specified in this manual,
  • are illogical,
Changed:
<
<
  • result rule-transducer, whose intersection might be empty or
>
>
  • result in rule-transducer, whose intersection might be empty or
 
  • over-shadow other statements.

error messages or warnings will be issued. Statements, which make it impossible to complete the compilation of the grammar lead to error-messages and disruption of the compilation-process. Statements, that over-shadow other statements, or may lead to rule-sets whose intersection is empty lead to warning-messages.

Line: 329 to 330
 
ERROR ON LINE 79:
syntax error, unexpected CENTER_MARKER, expecting DIFFERENCE or INTERSECTION or UNION or RIGHT_SQUARE_BRACKET
Deleted:
<
<
Cx:Cy <=> [ h | Liquid | Vowel: _ Vowel: Cons: [ Cons: | #:0 ] ; ^ HERE
 Aborted.

An error-message consists of

Revision 342009-10-07 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 22 to 22
 
-q, --quiet Don't print any diagnostics messages.
-r, --resolve Attempt to resolve conflicts between rules. If omitted, conflicts aren't resolved.
-N, --names If this option is given, the names of the rules in the grammar file are saved in a file .names. This file may given as paramteter to the utility HfstPairTest.
Added:
>
>
-w, --weighted Compile the rules into weighted transducers with zero weights.
 
-v, --verbose Display detailed information concerning the compilation process.
-h, --help Display a help-message.
-u, --usage Display usage.
Line: 37 to 38
 

Syntax

Changed:
<
<
A twol-grammar consists of six parts: Alphabet, Diacritics, Rule-Variables, Sets, Definitions and Rules. Each part contains statements, that end in a ; character and comments, that begin with a ! character and span to the end of the line.
>
>
A twol-grammar consists of five parts: Alphabet, Diacritics, Sets, Definitions and Rules. Each part contains statements, that end in a ; character and comments, that begin with a ! character and span to the end of the line.
 
Alphabet
Changed:
<
<
! The alphabet should contain all pairs used in the rules.
>
>
! The alphabet should contain symbols which are used in the grammar.
 ! Characters consist of strings of utf-8 characters. No white-space, though!
Changed:
<
<
a b c d e f g h i j k l m n o p q r s t u v w x y z å ä ö N:n N:m ;
>
>
a b c d e f g h i j k l m n o p q r s t u v w x y z å ä ö N:m N:n ;
  Sets Consonant = b c d f g h j k l m n p q r s t v w x z m n ;
Line: 55 to 56
  Rules
Deleted:
<
<
! input/output -pairs for testing the rule-set:

input: k a N p a n output: k a m m a n

input: k a N T a n output: k a n n a n

input: k a m p i output: k a m p i

 "N:m before input-character p" ! A common morpho-phonetic phenomenon N:m <=> _ p: ;
Line: 79 to 69
 

Regular Expression Syntax

Changed:
<
<
Any character-pair defined in the alphabet is a regular expression e.g. a or a:b. The following special pair-constructs are available:
>
>
Any pair of symbols defined in the alphabet is a regular expression e.g. a or a:b. The following special pair-constructs are available:
 
Changed:
<
<
  • a:? and a: match any pair in the alphabet having input-character a.
  • ?:a and :a match any pair in the alphabet having output-character a.
>
>
  • a:? and a: match any pair in the grammar having input-character a.
  • ?:a and :a match any pair in the grammar having output-character a.
 
  • ? matches any pair in the alphabet.
  • ?:? same as ?. You may also use : surrounded by white-space.
Changed:
<
<
  • 0 matches the empty string.
>
>
  • a:0 and 0:a correspond to deletion and insertion of a.
  • 0 matches the empty string (this is probably useless...).
 
Changed:
<
<
Warning, important Pair-constructions like [:a may cause some problems. Now [ :a is preferable.
>
>
Warning, important When you use constructions like :a, make sure to surround them with white-space, i.e. use ( :a) not (:a) and ( : a) not (: a). Omitting white-space will break the scanning of the grammar (this might be fixed in the future).
  By concatenating pairs, one can build longer regular expressions matching strings of pairs. If the alphabet is declared
Alphabet
Changed:
<
<
a N:n N:m e
>
>
a e N:m N:n
 
Changed:
<
<
then the regular expression a N: e will match a N:n e and a N:m e.
>
>
then the regular expression a N: e will match a N:m e and a N:n e.
  Regular expressions may be grouped together using the parenthesis-constructions [ ... ] and ( ... ). If R is a regular expression, then [ R ] matches exactly the same strings of pairs as R does. The construction ( R ), on the other hand, always matches the empty string, as well.
Line: 102 to 93
  There are seven unary regular-expression operators in hfst-twolc for the time being. Let the Alphabet be = a N:n N:m o= and let R denote a regular expression. The unary operators are:
Changed:
<
<
  • The power-operator ^INTEGER, which is equivalent to concatenation of the argument-expression with itself INTEGER times. E.g. a^3 is equivalent to a a a.
>
>
  • The power-operator ^INTEGER, which is equivalent to concatenation of the argument-expression with itself INTEGER times. E.g. a^3 is equivalent to a a a (NOT IMPLEMENTED for some reason... coming soon).
 
  • The containment-operator $. The regular-expression $R matches any string containing at least one substring matched by R. E.g. $a is equivalent to [ a N:n N:m e ]* a [ a N:n N:m e]*, using the alphabet defined above.
  • The exact containment-operator $. is similar to the containment operator, but the mathcing strings have to contain exactly one substring matching R. E.g. $.a is equivalent to [ N:n N:m e ]* a [ N:n N:m e]* using the Alphabet defined above.
  • The term-complement-operator \. The term-complement of R is the language \R containing every pair, that is not matched by R. E.g. \a is equivalent to [ N:n N:m e ] with the Alphabet defined above. Note that the term-complement is not the same thing as the negation of a language.
Line: 138 to 129
  The first part specifies the alphabet of the rules. The alphabet consists of pairs of an input-character and an output-character like a:a or N:m. If the input-character and output-character are the same, it is customary to denote the pair by the input-symbol, so a:a is usually written a. The alphabet is one statement so it is terminated by a semi-colon.
Changed:
<
<
Every pair of character referred to in some of the rules, has to be declared in the alphabet. Otherwise a warning will be issued. The grammar will still be compiled, but the rules may be compiled erroneously. E.g. the any-character ? denotes any pair declared in the alphabet and only those. Hence ? won't match pairs, which aren't declared in the alphabet.
>
>
Every symbol referred to in some of the rules, has to be declared in the alphabet. Otherwise an error message will be issued.
  Any non-empty string of non-white-space UTF-8 characters, that isn't a reserved word, is a valid alphabet-character. For now this means, that the characters shouldn't contain newlines, spaces, tabs or carriage-returns and shouldn't be found in the section List of reserved words below.
Line: 146 to 137
 
Alphabet
Changed:
<
<
! The alphabet should contain all pairs used in the rules.
>
>
! The alphabet should contain all symbols used in the rules.
 ! Characters consist of strings of utf-8 characters. No white-space, though! a b c d e f g h i j k l m n o p q r s t u v w x y z å ä ö N:n N:m ;
Line: 178 to 169
 allows the correspondence v i i k .:0 k o .:0 I:j a despite the intervening pair .:0.

Rule-variables

Changed:
<
<
Rules may contain variables. Any variable used, should be declared in the Rule-variables section. If this isn't done, thaere will be warning-messages, but the grammar will still compile correctly.
>
>
This section exists, so that grammars which compiled under HfstTwolC 1.0 also compile under HfstTwolC 2.0. In HfstTwolC 1.0 rule variables needed to be declared, but this isn't madatory in HfstTwolC 2.0.

Rules may contain variables. Any variable used, can be declared in the Rule-variables section.

  An example
Rule-variables
Deleted:
<
<
! All rule-variables, that are used, should be declared. ! If this isn't done, annoying warning-messages will be ! issued (otherwise the grammar is constructed as it ! should be)
  Cx Cy Cz Vx Vy ;

Sets

Line: 204 to 192
 
t:&#1257; <= :Vowel _ :Vowel ;
Added:
>
>
The construction :Vowel will match any pair, used in some rule, where the output symbol is a vowel.

It is possible to define a set having the same name as an alphabet character. There is no guarantee what will happen, if this is done.

 

Definitions

Line: 214 to 205
  The regular-expression syntax is the same as the syntax used in the two-level rules of the grammar. All sets may be used in definitions and all definitions, which have been made before a particular definition, may be used as a part of that definition.
Changed:
<
<
It is possible to define a named regular expression having the same name as a set or alphabet character. It will over-shadow the declaration of the set.
>
>
It is possible to define a named regular expression having the same name as a set or alphabet character. There is no guarantee what will happen, if this is done.
 

Rules

Two-level rules consist of a center, a rule-operator and contexts.

Changed:
<
<
The center-language (C) is a character-pair (e.g. a:b), a more general pair-construct of a single character (e.g. a: or :a) or set of character-pairs and character-pair-constructs (e.g. [a:b | b: | c:d ]). A context consists of two regular expressions (Li and Ri) separated by an underscore. Schematically
>
>
The center-language (C) is a

  • character-pair (e.g. a:b),
  • a more general pair-construct of a single character (e.g. a: or :a),
  • a set construct like a:S, where S is a symbol set,
  • or a disjunction of such centers (e.g. a:b | b: | c:d | a:S).

A context consists of two regular expressions (Li and Ri) separated by an underscore. Schematically

 
C OP L1 _ R1 ;
     L2 _ R1 ;

Revision 332009-10-07 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 10 to 10
 

Usage

Changed:
<
<
hfst-twolc --input FILE [ --output FILE ] [ --test_file FILE ] [ --test ] [ --no-report ] [ --resolve ]
>
>
USAGE: hfst-twolc [ OPTIONS ] [ GRAMMARFILE ]
 

Parameters

Parameter name Meaning
Changed:
<
<
input the rule file.
output If omitted, the resulting transducer is written to STDOUT.
test_file A file containing test-pairs for the grammar.
test Toggle test-mode. If this parameter is present, the rules won't be compiled, but tested instead.
no-report Don't warn about conflicts between rules. If omitted, all rule-conflicts will give a warning.
resolve Attempt to resolve conflicts between rules. If omitted, conflicts aren't resolved.
savenames If this option is given, the names of the rules in the grammar file are saved in a file .names. This file may given as paramteter to the utility HfstPairTest.
verbose Display detailed information concerning the compilation process.
>
>
-i, --input the rule file.
-o, --output If omitted, the resulting transducer is written to STDOUT.
-s, --silent Don't print any diagnostics messages.
-q, --quiet Don't print any diagnostics messages.
-r, --resolve Attempt to resolve conflicts between rules. If omitted, conflicts aren't resolved.
-N, --names If this option is given, the names of the rules in the grammar file are saved in a file .names. This file may given as paramteter to the utility HfstPairTest.
-v, --verbose Display detailed information concerning the compilation process.
-h, --help Display a help-message.
-u, --usage Display usage.
 

Outline

Revision 322009-10-05 - ErikAxelson

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 33 to 34
 
set of pairs
a subset of feasible character pairs (corresponds to the disjunction of the pairs listed in the definition).
input symbol
a token to be input to a FST; the left-hand side of a pair, i.e. a in a pair a:b
Deleted:
<
<

Getting the Program

There is now a binary of the program available (for x86-architecture), as well as the source-code, on the Source code page on the hfst webpage (on the webpage of the Department of General Linguistics in Helsinki University).

The program may be downloaded from the CVS-repository on Corpus. The path of the CVS-repository is /c/appl/ling/koskenni/cvsrepo/. Currently hfst-twolc is in the directory htwolc, which contains the files

/c/appl/ling/koskenni/cvsrepo/htwolc:
yhteensä 240
drwxrwxr-x  2 silfverb kikosken  4096 22. syys   12:39 Attic
-r--r--r--  1 silfverb kikosken 11564 17. syys   15:16 commandline.h,v
-r--r--r--  1 silfverb kikosken 63833 17. syys   15:16 htwolc.yy,v
-r--r--r--  1 silfverb kikosken  5430 10. syys   17:42 Makefile,v
-r--r--r--  1 silfverb kikosken  9299 25. heinä  12:41 muutokset,v
-r--r--r--  1 silfverb kikosken 55475 17. syys   15:16 operations.C,v
-r--r--r--  1 silfverb kikosken 31636 10. syys   17:42 operations.h,v
-r--r--r--  1 silfverb kikosken   611 26. elo    10:28 README,v
drwxrwxr-x  3 silfverb kikosken  4096 22. syys   12:47 test
-r--r--r--  1 silfverb kikosken 18502 17. syys   15:16 tokenizer.ll,v
-r--r--r--  1 silfverb kikosken 15997 22. syys   12:39 tutorial-fragment.rst,v

/c/appl/ling/koskenni/cvsrepo/htwolc/test:
yhteensä 8
drwxrwxr-x  2 silfverb kikosken 4096 22. syys   12:47 Attic
-r--r--r--  1 silfverb kikosken 3402 22. syys   12:47 gradation-rules.twol,v

Dependencies

You should have hfst installed.

Installing the Program

If you're working on corpus, make should be sufficient. You may need to modify the variable

HFSTPATH=../hfst/
depending on, where you've got hfst installed.
 

Syntax

A twol-grammar consists of six parts: Alphabet, Diacritics, Rule-Variables, Sets, Definitions and Rules. Each part contains statements, that end in a ; character and comments, that begin with a ! character and span to the end of the line.

Line: 649 to 612
  http://www.xrce.xerox.com/competencies/content-analysis/fssoft/docs/twolc-92/twolc92.html

  • A. Yli-Jyrä, K. Koskenniemi, Compiling Generalized Two-Level Rules and Grammars, Advances in Natural Language Processing, Springer Berlin/Heidelberg, pages 174-185, 2006
Added:
>
>

Obtaining the program and installing

hfst-twolc is a part of HfstCommandLineTools.

 

Revision 312009-10-05 - ErikAxelson

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 13 to 13
 hfst-twolc --input FILE [ --output FILE ] [ --test_file FILE ] [ --test ] [ --no-report ] [ --resolve ]
Added:
>
>

Parameters

 
Parameter name Meaning
input the rule file.
output If omitted, the resulting transducer is written to STDOUT.

Revision 302009-09-29 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Added:
>
>

Purpose

Compile a two-level grammar in Xerox Twolc formalism into a weighted or unweighted HFST transducer.

 

Usage

hfst-twolc --input FILE [ --output FILE ] [ --test_file FILE ] [ --test ] [ --no-report ] [ --resolve ]

Revision 292009-05-31 - KristerLinden

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 6 to 6
 

Usage

Changed:
<
<
htwolc --input FILE [ --output FILE ] [ --test_file FILE ] [ --test ] [ --no-report ] [ --resolve ]
>
>
hfst-twolc --input FILE [ --output FILE ] [ --test_file FILE ] [ --test ] [ --no-report ] [ --resolve ]
 

Parameter name Meaning

Revision 282009-03-11 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 16 to 16
 
test Toggle test-mode. If this parameter is present, the rules won't be compiled, but tested instead.
no-report Don't warn about conflicts between rules. If omitted, all rule-conflicts will give a warning.
resolve Attempt to resolve conflicts between rules. If omitted, conflicts aren't resolved.
Changed:
<
<
savenames If this option is given, the names of the rules in the grammar file are saved in a file .names. This file may given as paramteter to the utility HfstTest.
>
>
savenames If this option is given, the names of the rules in the grammar file are saved in a file .names. This file may given as paramteter to the utility HfstPairTest.
 
verbose Display detailed information concerning the compilation process.

Outline

Revision 272008-11-05 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 562 to 562
 

A Test-Tool for Grammars

Added:
>
>

Differences from Xerox twolc

 
Changed:
<
<

Unimplemented Features

>
>
This section contains a list of features, which differ between hfst-twolc and Xerox twolc.

Unimplemented Features in hfst-twolc

  This list contains features which, for the time being, are lacking from hfst-twolc, but will be added, or have been implemented differently from Xerox twolc, but will be changed. The missing features are gathered from Karttunen and Koskenniemi 1987 and Karttunen 1992.
Changed:
<
<

Partial implementations

>
>

Partial implementations in hfst-twolc

  Since this is an alpha-version of hfst-twolc, there are many features, that have limited functionality.
Line: 624 to 627
  The OpenFst-implementation may be very slow.
Changed:
<
<

Differences from Xerox twolc

>
>

Permanent differences from Xerox twolc

  This list contains features, which are intended to differ from corresponding features in the Xerox twolc program.

  • All valid character-pairs should be declared in the Alphabet. Other character-pairs may be used in the rules, but this will raise a warning. The construction ? (and corresponding constructions) in regular expressions only matches character-pairs, which have been declared in the Alphabet.
Added:
>
>
  • All rule-variables have to be declared in the Rule-variables section in the header of the grammar.

 

References

Revision 262008-11-05 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

hfst-twolc − A Two-Level Grammar Compiler

Line: 17 to 17
 
no-report Don't warn about conflicts between rules. If omitted, all rule-conflicts will give a warning.
resolve Attempt to resolve conflicts between rules. If omitted, conflicts aren't resolved.
savenames If this option is given, the names of the rules in the grammar file are saved in a file .names. This file may given as paramteter to the utility HfstTest.
Added:
>
>
verbose Display detailed information concerning the compilation process.
 

Outline

Terms and concepts:

Line: 146 to 147
 
  • The conjunction-operator &. The language R & S matches any string matched by both R and S and only those.
  • The difference-operator -. The language R - S matches any string matched by R, but not by S and only those.
Changed:
<
<
By default the binary operations bind from the left. Hence a - a - a is equivalent to [ a - a ] - a i.e. matches the empty language. If the binary operators would bind from the right, then a - a - a would be equivalent to a - [ a - a ] i.e. equivalent to a.
>
>
By default the binary operations bind from the left. Hence a - a - a is equivalent to [ a - a ] - a i.e. matches the empty language. If the binary operators were to bind from the right, then a - a - a would be equivalent to a - [ a - a ] i.e. equivalent to a.
 

Operator Precedence

Changed:
<
<
The operators in htwolc have different precedence. A rule of thumb for precedence: unary operators bind strongest, then concatenation and last binary operators. The constructions [ ... ] and ( ... ) override all other precedence rules.
>
>
The operators in htwolc have different precedence. A rule of thumb for precedence: unary operators have the strongest bind, then concatenation and finally binary operators. The constructions [ ... ] and ( ... ) override other precedences.
  Operators ordered by precedence from strongest to weakest:
Line: 189 to 190
  E.g. markers for syllable-boundaries and all kinds of markers appended to word-forms by the lexicon may be such symbols.
Changed:
<
<
It's easiest to declare such symbols diacritics in hfst-twolc. This is done by menioning them in the section Diacritics, which may look like
>
>
It's easiest to declare such symbols diacritics in hfst-twolc. This is done by mentioning them in the section Diacritics, which may look like
 
Diacritics

Revision 252008-10-10 - KristerLinden

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"
Changed:
<
<

OMorFi: hfst-twolc -- An Open-Source Two-Level Grammar Compiler

>
>

hfst-twolc − A Two-Level Grammar Compiler

 

Revision 242008-09-26 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

OMorFi: hfst-twolc -- An Open-Source Two-Level Grammar Compiler

Line: 16 to 16
 
test Toggle test-mode. If this parameter is present, the rules won't be compiled, but tested instead.
no-report Don't warn about conflicts between rules. If omitted, all rule-conflicts will give a warning.
resolve Attempt to resolve conflicts between rules. If omitted, conflicts aren't resolved.
Changed:
<
<
>
>
savenames If this option is given, the names of the rules in the grammar file are saved in a file .names. This file may given as paramteter to the utility HfstTest.
 

Outline

Terms and concepts:

Revision 232008-09-24 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

OMorFi: hfst-twolc -- An Open-Source Two-Level Grammar Compiler

Line: 34 to 34
 
/c/appl/ling/koskenni/cvsrepo/htwolc:
yhteensä 240
Changed:
<
<
drwxrwxr-x 2 silfverb kikosken 4096 22. syys   12:39 Attic -r--r--r-- 1 silfverb kikosken 11564 17. syys   15:16 commandline.h,v -r--r--r-- 1 silfverb kikosken 63833 17. syys   15:16 htwolc.yy,v -r--r--r-- 1 silfverb kikosken 5430 10. syys   17:42 Makefile,v -r--r--r-- 1 silfverb kikosken 9299 25. heinä  12:41 muutokset,v -r--r--r-- 1 silfverb kikosken 55475 17. syys   15:16 operations.C,v -r--r--r-- 1 silfverb kikosken 31636 10. syys   17:42 operations.h,v -r--r--r-- 1 silfverb kikosken 611 26. elo    10:28 README,v drwxrwxr-x 3 silfverb kikosken 4096 22. syys   12:47 test -r--r--r-- 1 silfverb kikosken 18502 17. syys   15:16 tokenizer.ll,v -r--r--r-- 1 silfverb kikosken 15997 22. syys   12:39 tutorial-fragment.rst,v
>
>
drwxrwxr-x 2 silfverb kikosken 4096 22. syys 12:39 Attic -r--r--r-- 1 silfverb kikosken 11564 17. syys 15:16 commandline.h,v -r--r--r-- 1 silfverb kikosken 63833 17. syys 15:16 htwolc.yy,v -r--r--r-- 1 silfverb kikosken 5430 10. syys 17:42 Makefile,v -r--r--r-- 1 silfverb kikosken 9299 25. heinä 12:41 muutokset,v -r--r--r-- 1 silfverb kikosken 55475 17. syys 15:16 operations.C,v -r--r--r-- 1 silfverb kikosken 31636 10. syys 17:42 operations.h,v -r--r--r-- 1 silfverb kikosken 611 26. elo 10:28 README,v drwxrwxr-x 3 silfverb kikosken 4096 22. syys 12:47 test -r--r--r-- 1 silfverb kikosken 18502 17. syys 15:16 tokenizer.ll,v -r--r--r-- 1 silfverb kikosken 15997 22. syys 12:39 tutorial-fragment.rst,v
  /c/appl/ling/koskenni/cvsrepo/htwolc/test: yhteensä 8
Changed:
<
<
drwxrwxr-x 2 silfverb kikosken 4096 22. syys   12:47 Attic -r--r--r-- 1 silfverb kikosken 3402 22. syys   12:47 gradation-rules.twol,v
>
>
drwxrwxr-x 2 silfverb kikosken 4096 22. syys 12:47 Attic -r--r--r-- 1 silfverb kikosken 3402 22. syys 12:47 gradation-rules.twol,v
 

Dependencies

Line: 185 to 185
 
  • act as ques for certain phonological rules to act,
  • are irrelevant for all other rules and
Changed:
<
<
  • should not be present in the phonological
>
>
  • should not be present in the phonological representation of word-forms.

E.g. markers for syllable-boundaries and all kinds of markers appended to word-forms by the lexicon may be such symbols.

It's easiest to declare such symbols diacritics in hfst-twolc. This is done by menioning them in the section Diacritics, which may look like

Diacritics

      ! The symbol . marks a syllable-boundary.
      . ;

Diacritics have the following properties

  • They always correspond to 0 on the output-side.
  • All diacritics, that aren't explicitly mentioned in a rule are invisible to that rule.

E.g. given the diacritics-declaration above and the set Vowel given in the next section, the rule

I:j <=> Vowel _ Vowel ;
allows the correspondence v i i k .:0 k o .:0 I:j a despite the intervening pair .:0.
 

Rule-variables

Rules may contain variables. Any variable used, should be declared in the Rule-variables section. If this isn't done, thaere will be warning-messages, but the grammar will still compile correctly.
Line: 386 to 405
 
LEFT_SQUARE_BRACKET [
RIGHT_SQUARE_BRACKET ]
LEFT_BRACKET (
Changed:
<
<
=RIGHT_BRACKET
)
>
>
RIGHT_BRACKET
)
 
LEFT_RESTRICTION_ARROW
/<=
LEFT_ARROW
<=
RIGHT_ARROW
=>

Revision 222008-09-24 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

OMorFi: hfst-twolc -- An Open-Source Two-Level Grammar Compiler

Line: 66 to 66
 

Syntax

Changed:
<
<
A twol-grammar consists of four parts: Alphabet, Sets, Definitions and Rules. Each part contains statements, that end in a ; character and comments, that begin with a ! character and span to the end of the line.
>
>
A twol-grammar consists of six parts: Alphabet, Diacritics, Rule-Variables, Sets, Definitions and Rules. Each part contains statements, that end in a ; character and comments, that begin with a ! character and span to the end of the line.
 
Alphabet
Line: 179 to 179
 ! Characters consist of strings of utf-8 characters. No white-space, though! a b c d e f g h i j k l m n o p q r s t u v w x y z å ä ö N:n N:m ;
Added:
>
>

Diacritics

The morpho-phonological description of a language may contain symbols, which

  • act as ques for certain phonological rules to act,
  • are irrelevant for all other rules and
  • should not be present in the phonological
 

Rule-variables

Rules may contain variables. Any variable used, should be declared in the Rule-variables section. If this isn't done, thaere will be warning-messages, but the grammar will still compile correctly.

Revision 212008-09-22 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

OMorFi: hfst-twolc -- An Open-Source Two-Level Grammar Compiler

Line: 314 to 314
 

Error-Messages and Warnings

Added:
>
>
If the grammar given to hfst-twolc contains statements, which

  • don't conform to the syntax specified in this manual,
  • are illogical,
  • result rule-transducer, whose intersection might be empty or
  • over-shadow other statements.

error messages or warnings will be issued. Statements, which make it impossible to complete the compilation of the grammar lead to error-messages and disruption of the compilation-process. Statements, that over-shadow other statements, or may lead to rule-sets whose intersection is empty lead to warning-messages.

Error-Messages

Errors in hfst-twolc are divided into two cathegories. Syntax-errors and logical errors.

Syntax Errors

A syntax-error is given, when the input-file violaes the syntax-specifications in this manual. When this happens, hfst-twolc gives an error-message and the compilation-process seizes, without writing to the output-file. An example of an syntax-related error-message is

ERROR ON LINE 79:
syntax error, unexpected CENTER_MARKER, expecting DIFFERENCE or INTERSECTION or UNION or RIGHT_SQUARE_BRACKET
Cx:Cy <=>  [ h | Liquid | Vowel:   _ Vowel: Cons: [ Cons: | #:0 ] ;
                                    ^ HERE
Aborted.

An error-message consists of

  • the number of the line, where the error occurred,
  • a statemetn of which token caused the compilation to halt and what kind of token was expected,
  • the line, which contaied the error and
  • a marker, which points out the place, where the error occurred.

Note, that it is not always possible to say exactly where the actual error was. Sometimes even the line on which the error occurs can't be signled out.

The correspondences between tokens and token-names should be pretty clear, but here's a list

Token-name Token
ALPHABET_DECLARATION Alphabet
DIACRITICS_DECLARATION Diacritics
VARIABLE_DECLARATION Rule-Variables
DEFINITION_DECLARATION Definitions
SETS_DECLARATION Sets
RULES_DECLARATION Rules
WHERE where
MATCHED matched
MIXED mixed
IN in
NEWLINE A newline.
RULE_NAME A quoted string of characters (except ").
AND and
STAR *
PLUS +
COMPLEMENT ~
TERM_COMPLEMENT \
CONTAINMENT_ONCE $.
CONTAINMENT $
ANY ?
UNION = =
INTERSECTION &
POWER ^
DIFFERENCE -
NUMBER A positive or negative integer.
EPSILON 0
LEFT_SQUARE_BRACKET [
RIGHT_SQUARE_BRACKET ]
LEFT_BRACKET (
=RIGHT_BRACKET
)
LEFT_RESTRICTION_ARROW
/<=
LEFT_ARROW
<=
RIGHT_ARROW
=>
LEFT_RIGHT_ARROW
<=>
PAIR_SEPARATOR_BOTH A : preceded by white-space and followed by something, that isn't a SYMBOL.
PAIR_SEPARATOR_RIGHT A : preceeded by white-space and followed by a SYMBOL.
PAIR_SEPARATOR_LEFT A : preceeded by a SYMBOL and followed by something, that isn't a SYMBOL.
PAIR_SEPARATOR A : preceeded and followed by a SYMBOL.
EOL ;
EQUALS
=
CENTER_MARKER _
SYMBOL A sequence of characters, where every special-character (i.e. one with a special meaning like [, ;, or %) has been quoted. A symbol may not contain newlines!

Logical Errors

hfst-twolc currently only gives one kind of logical error. Let a grammar contain the following rule

"Geminate gradation"
Cx:0 <=> :Cy _ ClosedCoda ; where Cx in ( K P T )
                                  Cy in ( k p )
                            matched;
Here the sets ( K P T ) and ( k p ) are of unequal length, so it is impossible to match the variables Cx and Cy. An error-meesge is issued
ERROR ON LINE 87:
Cx and Cy can't be matched since they correspond to lists of un-equal lengths!
                            matched;
                                    ^ HERE
Aborted.
 

Warnings

Changed:
<
<
The following is an example of a wring given by hfst-twolc
>
>
The following is an example of a warning given by hfst-twolc
 
WARNING! LINE 7:
[1] The pair a:b wasn't declared in the alphabet!
Line: 342 to 437
 

Resolution of Conflicts between the Rules

Changed:
<
<
A pair-string is accepted by a two-level grammar, iff it is accepted by each of the rues in the grammar. Hence there may be strings, that are accepted by some of the rules and rejected by others. While this is often intentional, there are at least two cases, where it has shown to be beneficial for the overall quality of the grammar to make some automatical modifications to the rules. These so called right- and left-arrow conflicts are handled by the mechanism of conflict-resolution in hfst-twolc.
>
>
A pair-string is accepted by a two-level grammar, iff it is accepted by each of the rules in the grammar. Hence there may be strings, that are accepted by some of the rules and rejected by others. While this is often intentional, there are at least two cases, where it has shown to be beneficial for the overall quality of the grammar to make some automatical modifications to the rules. These so called right- and left-arrow conflicts are handled by the mechanism of conflict-resolution in hfst-twolc.
  A situation, where one rule accepts a pair-string and another rejects it, shouldn't always be regarded as a conflict. In hfst-twolc it is regarded as a conflict, only if both of the rules are actually applied in the sense discussed in Yli-Jyrä and Koskenniemi 2006. Normal rule-interaction constrains the surface-realizations of some input-form, but do not loose all of them. In contrast to this rule-conflicts often filter away some input-forms completely. There are many kinds of conflicts, but for the time-being only right-arrow conflicts and left-arrow-conflicts are automatically resolved by hfst-twolc.

Revision 202008-09-22 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

OMorFi: hfst-twolc -- An Open-Source Two-Level Grammar Compiler

Line: 118 to 118
  Warning, important Pair-constructions like [:a may cause some problems. Now [ :a is preferable.

Changed:
<
<
By concatenating pairs, one can build longer regular expressions matching pairs of strings. If the alphabet is declared
>
>
By concatenating pairs, one can build longer regular expressions matching strings of pairs. If the alphabet is declared
 
Alphabet
a N:n N:m e
Changed:
<
<
then the regular expression a N: e will match the pairs of strings a N:n e and a N:m e.
>
>
then the regular expression a N: e will match a N:n e and a N:m e.
 
Changed:
<
<
Regular expressions may be grouped together using the parenthesis-constructions [ ... ] and ( ... ). If R is a regular expression, then [ R ] matches exactly the same strings of pairs as R does. The construction ( R ), on the other hand, matches the empty string, as well.
>
>
Regular expressions may be grouped together using the parenthesis-constructions [ ... ] and ( ... ). If R is a regular expression, then [ R ] matches exactly the same strings of pairs as R does. The construction ( R ), on the other hand, always matches the empty string, as well.
  Grouping becomes important, when one uses unary regular expression operators. Unary operators like * have higher precedence, than concatenation. This means that e.g. a b* is equivalent to [ a ] [ b * ]. If one wants the * operator to apply to the whole expression a b one has to group the expressions a and b together i.e. [ a b ]*.
Changed:
<
<
There are seven unary regular-expression operators in hfst-twolc for the time being. Let the Alphabet be [ a N:n N:m o] and let R denote a regular expression. The unary operators are:
>
>
There are seven unary regular-expression operators in hfst-twolc for the time being. Let the Alphabet be = a N:n N:m o= and let R denote a regular expression. The unary operators are:
 
Changed:
<
<
  • The power-operator ^INTEGER, which is equivalent to concatenation of the argument-expression with itself INTEGER times. E.g. a^3 is equivalent a a a.
  • The containment-operator $. The regular-expression $R matches any string containing at least one substring matched by R. E.g. $a is equivalent to [ a N:n N:m e ]* a [ a N:n N:m e]*, with the alphabet defined above.
  • The exact containment-operator $. is similar to the containment operator, but the mathcing strings have to contain exactly one substring matching R. E.g. $.a is equivalent to [ N:n N:m e ]* a [ N:n N:m e]* with the Alphabet defined above.
>
>
  • The power-operator ^INTEGER, which is equivalent to concatenation of the argument-expression with itself INTEGER times. E.g. a^3 is equivalent to a a a.
  • The containment-operator $. The regular-expression $R matches any string containing at least one substring matched by R. E.g. $a is equivalent to [ a N:n N:m e ]* a [ a N:n N:m e]*, using the alphabet defined above.
  • The exact containment-operator $. is similar to the containment operator, but the mathcing strings have to contain exactly one substring matching R. E.g. $.a is equivalent to [ N:n N:m e ]* a [ N:n N:m e]* using the Alphabet defined above.
 
  • The term-complement-operator \. The term-complement of R is the language \R containing every pair, that is not matched by R. E.g. \a is equivalent to [ N:n N:m e ] with the Alphabet defined above. Note that the term-complement is not the same thing as the negation of a language.
  • The negation-operator ~. The negation of a regular-expression R contains all strings not matched by R.
Changed:
<
<
  • The Kleene-star *. The language R* matches any string, that is a concatenation of any number of string from R. Note that the empty string, which is the concatenation of zero strings also matched. E.g. a* matches the empty string, a, a a, a a a and so on.
  • The plus-operator resembles the *, but it only matches strings, which are concatenation of a positive number of strings from R. Consequently R+ matches the empty string, iff R matches the empty string. E.g. a+ matches a, a a, a a a and so on.
>
>
  • The Kleene-star *. The language R* matches any string, which is the concatenation of any number of string from R. Note that the empty string, which is the concatenation of zero strings also matched. E.g. a* matches the empty string, a, a a, a a a and so on.
  • The plus-operator resembles *, but only matches strings, which are concatenation of a positive number of strings from R. Consequently R+ matches the empty string, iff R matches the empty string. E.g. a+ matches a, a a, a a a and so on.
 
Changed:
<
<
In addition to the unary operators there are three binary operators, which may be used to build regular expressions out of existing ones. Binary operators have the lowest precedence. Hence, when using the disjunction-operation |, e.g. a b* | c d is equivalent to [ a b* ] | [ c d ] and will match anything matched by a b* or by c d. One can group expressions together so a [ b * | c ] d will match a string beginning with a followed by zero or more b symbols or a c and ending with a d.
>
>
In addition to unary operators there are three binary operators, which may be used to build regular expressions out of existing ones. Binary operators have the lowest precedence. Hence, e.g. a b* | c d is equivalent to [ a b* ] | [ c d ] and will match anything matched by a b* or by c d. One can group expressions together so a [ b * | c ] d will match a string beginning with a followed by zero or more b symbols or a c and ending with a d.
  Let R and S be regular expressions. The binary operators are:
  • The disjunction-operator |. The language R | S matches any string matched by R or S and only those.
Line: 150 to 150
 

Operator Precedence

Changed:
<
<
The operators in htwolc have different precedence. As a rule of thumb unary operators are the strongest, then concatenation and last binary operators. The constructions [ ... ] and ( ... ) override all other precedence rules.
>
>
The operators in htwolc have different precedence. A rule of thumb for precedence: unary operators bind strongest, then concatenation and last binary operators. The constructions [ ... ] and ( ... ) override all other precedence rules.
  Operators ordered by precedence from strongest to weakest:
Line: 165 to 165
 

The Alphabet

Changed:
<
<
The first part specifies the alphabet of the rules. The alphabet consists of pairs consisting of a input-character and a output-character like a:a or N:m. If the input-character and output-character are the same, it is customary to denote the pair by the input-symbol, so a:a is usually written a.
>
>
The first part specifies the alphabet of the rules. The alphabet consists of pairs of an input-character and an output-character like a:a or N:m. If the input-character and output-character are the same, it is customary to denote the pair by the input-symbol, so a:a is usually written a. The alphabet is one statement so it is terminated by a semi-colon.
 
Changed:
<
<
Every pair of character referred to in some of the rules, has to be declared in the alphabet. Otherwise a warning will be issued. The grammar will still be compiled, but the rules may be compiled erroneously. E.g. the any-character ? denotes any pair declared in the alphabet and only those. Hence ? won't match pairs, that aren't declared in the alphabet.
>
>
Every pair of character referred to in some of the rules, has to be declared in the alphabet. Otherwise a warning will be issued. The grammar will still be compiled, but the rules may be compiled erroneously. E.g. the any-character ? denotes any pair declared in the alphabet and only those. Hence ? won't match pairs, which aren't declared in the alphabet.
  Any non-empty string of non-white-space UTF-8 characters, that isn't a reserved word, is a valid alphabet-character. For now this means, that the characters shouldn't contain newlines, spaces, tabs or carriage-returns and shouldn't be found in the section List of reserved words below.
Changed:
<
<

The Sets

>
>
An example of an alphabet is
Alphabet

! The alphabet should contain all pairs used in the rules.
! Characters consist of strings of utf-8 characters. No white-space, though!
a b c d e f g h i j k l m n o p q r s t u v w x y z å ä ö N:n N:m ;

Rule-variables

Rules may contain variables. Any variable used, should be declared in the Rule-variables section. If this isn't done, thaere will be warning-messages, but the grammar will still compile correctly.
 
Changed:
<
<
The second part of the grammar specifies named character-ranges like
>
>
An example
Rule-variables

        ! All rule-variables, that are used, should be declared.
        ! If this isn't done, annoying warning-messages will be
        ! issued (otherwise the grammar is constructed as it
        ! should be)

        Cx Cy Cz Vx Vy ;

Sets

The second part of the grammar specifies named character-sets like

 
Vowel  = a e i o u y å ä ö ;
Changed:
<
<
Sets may be used in rules as a short-hand for collections of character-pairs. Perhaps one might want write a rule, which states, that the phoneme t is realised as its voiced fricative counter-part ө between two phonemes, which are realised as vowels. This could be accomplished by a rule
>
>
Sets may be used in rules as a short-hand for collections of character-pairs.

Perhaps one might want to write a rule, which states, that the phoneme t is realised as its voiced fricative counter-part ө between two phonemes, which are realised as vowels. This could be accomplished by the rule

 
t:&#1257; <= :Vowel _ :Vowel ;
Changed:
<
<

The Definitions

>
>

Definitions

  The third part of the grammar specifies named regular expressions, which may be used as a part of definitions of rules, e.g.
ClosedSyllable = Vowel+ [ ~Vowel ]+ ;
Changed:
<
<
The regular-expression syntax is the same as the syntax used in the two-level rules of the grammar. It is possible to define a named regular expression having the same name as a set or alphabet character. It will over-shadow the declaration of the set.
>
>
The regular-expression syntax is the same as the syntax used in the two-level rules of the grammar. All sets may be used in definitions and all definitions, which have been made before a particular definition, may be used as a part of that definition.
 
Changed:
<
<

The Rules

>
>
It is possible to define a named regular expression having the same name as a set or alphabet character. It will over-shadow the declaration of the set.
 
Changed:
<
<

Ordinary Two-Level Rules

>
>

Rules

Two-level rules consist of a center, a rule-operator and contexts.

 
Changed:
<
<
Two-level rules consist of a center, a rule-operator and contexts. The center is a pair of characters, and a context consists of two regular expressions separated by an underscore. Schematically
>
>
The center-language (C) is a character-pair (e.g. a:b), a more general pair-construct of a single character (e.g. a: or :a) or set of character-pairs and character-pair-constructs (e.g. [a:b | b: | c:d ]). A context consists of two regular expressions (Li and Ri) separated by an underscore. Schematically
 
Changed:
<
<
a:b OP L1 _ R1 ;
>
>
C OP L1 _ R1 ;
  L2 _ R1 ; ...
Changed:
<
<
Ln _ Rn ;
>
>
Ln _ Rn ;
 A rule has to have at least one context and it may have as many as are needed.
Changed:
<
<
The rules are constraints, regulating the distribution of the center-pairs according to the rule-operator and contexts given. Four different kinds of rules-operators may be used in hfst-twolc
<=, =>, <=> and \<=
>
>
A rule with variables, is a rule, where some of the characters in character-pairs are variables, not actual alphabetical characters. A rule with variables has to have an additional so called where-part, which shows how the variables in the rule should be instantiated.

Ordinary Two-Level Rules

Two-level rules are constraints, regulating the distribution of the pairs in their center-language according to the rule-operator and contexts given. Four different kinds of rules-operators may be used in hfst-twolc

<=, =>, <=> and /<=
 The final context, which is compiled into the transducer representing the two-level rule is the union of the contexts given.

Right-arrow rules constrain the distribution of a symbol-pair by specifying, that it may only occur in a specific context (or some specific contexts). Let the set V be the set of vowels in some language. An example of a right-arrow rule is

I:j => :V _ :V ;
Changed:
<
<
It states, that the input-character I can be realised as j only in a contex, where it is surrounded by output vowels, i.e. that the occurrence of the pair I:j is limited to positions between surface-vowels.
>
>
It states, that the input-character I can be realised as j only in a contex, where it is surrounded by output vowels. The rule doesn't constrain the distribution of any other pairs I:X, nor does it constrain the distribution of pairs X:j, where X is something else than I. It simply states, that if the pair I:j occurs, it has to occur between two output vowels.
  The context :V _ :V in the example is automatically extended to a so called total context, by hfst-twolc. This means that, when the rule is compiled, the context will become ?* :V _ :V ?*. This applies to all kinds of rule-operators.
Line: 222 to 250
 
N:m <= _ p: ;
Changed:
<
<
It states, that an input-character N has to be realized as the output-character m if it is followed by some pair with input-character p.
>
>
It states, that an input-character N has to be realized as the output-character m if it is followed by some pair with input-character p. The rule doesn't constrain the realizations of the input-character N in any other context, than the one specified, so it never disallows any occurrences of the pair N:m. It does disallow all other pairs N:X in the context _ :p, though.
 
Changed:
<
<
Left-arrow rules differ from right-arrow rules, because they are asymmetric with regard to the input- and output-level of pair-strings. The right-arrow example above, doesn't limit the input-character of a pair preceding p:, it only limits the output-character, if the input-character is N. Such an asymmetry is not present in left-arrow rules, which limit a particular pair into a particular kind of context.
>
>
Left-arrow rules differ from right-arrow rules, because they are asymmetric with regard to the input- and output-level of pair-strings. The left-arrow example above, doesn't limit the input-character of a pair preceding p:, it only limits the output-character, if the input-character is N. Such an asymmetry is not present in left-arrow rules, which limit a particular pair into a particular kind of context.
  Left/right -arrow rules, give a necessary and sufficient conditions for the realization of an input-character as some output-character. An example of a left/right -arrow rule is
Changed:
<
<
k:' <=> :a :a _ :a ;
>
>
K:' <=> :Vowel :a _ :a ClosedOffSet;
 
Changed:
<
<
which states, that k:' is realized as ' exactly in contexts where two output a phones precede and one follows (this describes a convention of Finnish orthography stemming from consonant degradation). Any left/right arrow rule is equivalent to the joined effect of the corresponding left- and right-arrow rules. Hence the example is equivalent to the pair of rules
>
>
which states, that the morpho-phoneme K is realized as ' exactly in contexts where a vowel and an output a precede and one output a and a closed syllable-offset follows (this describes a convention of Finnish orthography stemming from consonant gradation). Any left/right arrow rule is equivalent to the joined effect of the corresponding left- and right-arrow rules. Hence the example is equivalent to the pair of rules
 
Changed:
<
<
k:' <= :a :a _ :a ;
>
>
K:' <= :Vowel :a _ :a ClosedOffSet;
  and
Changed:
<
<
k:' => :a :a _ :a ;
>
>
K:' => :Vowel :a _ :a ClosedOffSet;
 
Added:
>
>
Actually the alternation K:' isn't constrained to a context, where two a:s precede. It happens between any two like vowels. To describe this nicely, without using five very similar rules, one needs rule-variables, which will be presented shortly.
  Prohibition rules disallow the realization of an input-character as some output-character in some contexts. Let again V denote the set of vowels. An example of a prohibition rule is
Changed:
<
<
I:i \<= :V _ :V ;
>
>
I:i /<= :V _ :V ;
  which states, that the input-character I may not be realized as i between output-vowels.

Like right-arrow rules, prohibition rules are symmetric with respect to the input- and output-level of pair-strings. In fact it is often possible to state a particular constraint both as a prohibition rule concerning some pair and a left-arrow rule concerning an other. If the input-character I may only be realized as i or j, then the rules

Changed:
<
<
I:i \<= :V _ :V ;
>
>
I:i /<= :V _ :V ;
  and
Line: 255 to 284
  state the exactly same constraint. Still, if the number of realizations is greater, it may be much easier to state the constraint using one of the operators than the other.
Changed:
<
<

Generalized Context-Restrictions

>
>

Rules with variables

 
Changed:
<
<

Special Rule-Constructs

Error-Messages and Warnings

Warnings

The following is an example of a wring given by hfst-twolc

>
>
As an easy short-hand for defining (a possibly large) set of similar two-level rules, rule-variables have been included to hfst-twolc. Consider the following rule, which is needed for gradation of stops in finnish
 
Changed:
<
<
WARNING! LINE 7: [1] The pair a:b wasn't declared in the alphabet! a:b <= c _ ; ^ HERE The program attempts to report the number of the line, which gives the warning and also point to the place, which gives the warning. Note, that the place and line given may not be accurate. When they're not, the problem is often on the previous line.
>
>
"Gradation of k to '" K:' <=> Vowel Vx _ Vx ClosedOffset ; where Vx in Vowel ; It deals with the realization of the morpho-phoneme K, when it is the onset of a closed syllable, which is preceded by an open syllable with a two-vowel nucleus. The rule states, that K is realized as ' (a glottal stop), if the nucleus of the preceding syllable ends with the same vowel, which figures as the nucleus of the closed syllable.
 
Changed:
<
<
The number [1] means, that this is a warning of type 1. There are 6 types of warnings. These are
>
>
The rule above couldn't be stated as a single rule, without variables, since there are no other mechanisms for specifying dependences between parts of the contexts of two-level rules. The use of the variable Vx is said to match the occurrences of the set Vowel.
 
Changed:
<
<
[1] E.g. the grammar
>
>
It is possible to match occurrences of variables from different sets, as well. Consider the following rule, which also deals with gradation of stops in finnish
 
Changed:
<
<
Alphabet a b c ;
>
>
"Geminate gradation" Cx:0 <=> :Cy _ ClosedCoda ; where Cx in ( K P T ) Cy in ( k p t ) matched; The rule states, that the morpho-phonemes K, P, T vanish, when they serve as the onset of a closed syllable and are preceded by a surface k, p or t respectively. Here the occurrences of the variable Cx are matched with those of Cy. For instance, nothing is said about an input K preceded by an output p. The rule is only concerned with input-level characters K preceded by output-level characters k.
 
Changed:
<
<
Rules
>
>
Occurences of variables are matched by default. If you don't want this to happen, you may either use several where parts to govern different variables or replace the keyword matched by freely.
 
Changed:
<
<
"R1" a:b <= c _ ; gives a type 1 warning
WARNING! LINE 7:
[1] The pair a:b wasn't declared in the alphabet!
a:b <= c _ ;
The pair a:b was used, but not declared in the alphabet. Since only pairs, that have been declared in the alphabet match ?, failing to declare a:b would probably prohibit all pair-strings containing a:b, if the grammar has more than one rule.
>
>

Generalized Context-Restrictions (NOT IMPLEMENTED YET)

 
Changed:
<
<
[2] (PARTIALLY UNIMPLEMENTED) E.g. the grammar
Alphabet
        a b c a:b ;
>
>
Generalized context-restrictions allow the definition of rules with a more general center-language, than normal two-level rules. They also let the user constrain the application of a particular rule to some contexts.
 
Changed:
<
<
Sets

A = a ; A = b ;

>
>

Weighted rules (NOT IMPLEMENTED YET)

 
Changed:
<
<
Rules
>
>
It may become possible to add weights to rules, which determine the relative importance of a rule in a conflict-situation.
 
Changed:
<
<
"R1" a:b <= c _ ; gives a type 2 warning
WARNING! LINE 9:
[2] You are redefining the set A

^ HERE
Here the set A has been defined twice. This is not fully functional yet! It will warn, if the user attempts to define a set with the same name as an alphabet-symbol. The later definition will be used, when compiling rules.
>
>

Error-Messages and Warnings

 
Changed:
<
<
The warning-message is given for the line after the actual redefining line. The cause for this is, that the parser of hfst-twolc eats the newlines after the set-declaration before compiling it.
>
>

Warnings

 
Changed:
<
<
[3] (PARTIALLY UNIMPLEMENTED) The grammar
>
>
The following is an example of a wring given by hfst-twolc
 
Changed:
<
<
Alphabet a b c a:b ;

Sets

A = a ;

>
>
WARNING! LINE 7: [1] The pair a:b wasn't declared in the alphabet! a:b <= c _ ; ^ HERE The program attempts to report the number of the line, which gives the warning and also point to the place, which gives the warning. Note, that the place and line given may not be accurate. When they're not, the problem is often on the previous line.
 
Changed:
<
<
Definitions
>
>
The number [1] means, that this is a warning of type 1. There are seven types of warnings. These are
 
Changed:
<
<
A = a* ;
>
>
[1] A pair X:Y was used in the grammar, but it wasn't declared in the alphabet and neither X, nor Y was the name of a set.
 
Changed:
<
<
Rules
>
>
[2] The same set is defined twice, or a set is defined, which has the same name as a symbol in the alphabet.
 
Changed:
<
<
"R1" a:b <= c _ ; gives the type 3 warning
WARNING! LINE 12:
[3] You are redefining the expression or set A
>
>
[3] The same definition is declared twice, or a defintiion has the same name as a set or a symbol in the alphabet.
 
Changed:
<
<
^ HEREThe symbol A is declared once as a set and another time as a definition (a warning is also issued, if there are two declarations of the same definition). This is not fully functional yet! It will warn, if the user attempts to define a set with the same name as an alphabet-symbol. The later definition will be used, when compiling rules.
>
>
[4] The same rule-name is used for two rules.
 
Changed:
<
<
The warning-message is given for the line after the actual redefining line. The cause for this is, that the parser of hfst-twolc eats the newlines after the set-declaration before compiling it.
>
>
[5] A construct X: or :X was used, where X is a symbol or a set. The expression didn't match a single pair in the alphabet.
 
Changed:
<
<
[4] Warnings of type 4, won't be given for now, since the compilation of rules has changed significantly, and this warning hasn't been reimplemented yet. Type 4 warnings warn about defining the same rule twice.
>
>
[6] The construction R^i was used, where i was not a positive integer.
 
Changed:
<
<
[7] Warning for a pair x:y where x is a diacritic and y is non-zero. Diacritics are always realised as aero, so y will be discarded.
>
>
[7] Warning for a pair x:y where x is a diacritic and y is non-zero. Diacritics are always realised as zero, so y will be discarded.
 

Resolution of Conflicts between the Rules

Line: 431 to 428
 Alphabet Definitions Rules Sets ! ; ? : _ | >     <
Changed:
<
<
<=> \<= [ ]
>
>
<=> /<= [ ]
 ( ) * + $ $. ~ < > - " = 0 ^ #
Added:
>
>
%
 
Changed:
<
<
The words and constructs may be used in rules by quoting with \. E.g. \? means question-mark, not any character-pair defined in the alphabet and \Sets is an ordinary name Sets not a declaration, that definitions of sets will follow. In the previous example \Sets could be used as a character in the alphabet, the name of a regular expression in the definition section of the grammar or the name of a set.
>
>
The words and constructs may be used in rules by quoting with %. E.g. %? means question-mark, not any character-pair defined in the alphabet and %Sets is an ordinary name Sets not a declaration, that definitions of sets will follow. In the previous example %Sets could be used as a character in the alphabet, the name of a regular expression in the definition section of the grammar or the name of a set.
 

A Test-Tool for Grammars

Revision 192008-09-22 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

OMorFi: hfst-twolc -- An Open-Source Two-Level Grammar Compiler

Line: 28 to 28
 

Getting the Program

Changed:
<
<
There is now a binary of the program available, as well as the source-code, on the Source code page on the hfst webpage (on the webpage of the Department of General Linguistics in Helsinki University).
>
>
There is now a binary of the program available (for x86-architecture), as well as the source-code, on the Source code page on the hfst webpage (on the webpage of the Department of General Linguistics in Helsinki University).
  The program may be downloaded from the CVS-repository on Corpus. The path of the CVS-repository is /c/appl/ling/koskenni/cvsrepo/. Currently hfst-twolc is in the directory htwolc, which contains the files
Changed:
<
<
-r--r--r-- 1 silfverb kikosken 11231 26. elo    10:28 commandline.h,v -rw-r--r-- 1 silfverb kikosken 0 10. heinä  18:39 #cvs.wfl.corpus3.csc.fi.32192 -r--r--r-- 1 tpirinen omorf 8444 8. elo    14:21 finnish.twol,v -r--r--r-- 1 silfverb kikosken 48243 26. elo    10:28 htwolc.yy,v -r--r--r-- 1 silfverb kikosken 3998 26. elo    10:28 Makefile,v
>
>
/c/appl/ling/koskenni/cvsrepo/htwolc: yhteensä 240 drwxrwxr-x 2 silfverb kikosken 4096 22. syys   12:39 Attic -r--r--r-- 1 silfverb kikosken 11564 17. syys   15:16 commandline.h,v -r--r--r-- 1 silfverb kikosken 63833 17. syys   15:16 htwolc.yy,v -r--r--r-- 1 silfverb kikosken 5430 10. syys   17:42 Makefile,v
 -r--r--r-- 1 silfverb kikosken 9299 25. heinä  12:41 muutokset,v
Changed:
<
<
-r--r--r-- 1 silfverb kikosken 43178 26. elo    10:28 operations.C,v -r--r--r-- 1 silfverb kikosken 26025 26. elo    10:28 operations.h,v
>
>
-r--r--r-- 1 silfverb kikosken 55475 17. syys   15:16 operations.C,v -r--r--r-- 1 silfverb kikosken 31636 10. syys   17:42 operations.h,v
 -r--r--r-- 1 silfverb kikosken 611 26. elo    10:28 README,v
Changed:
<
<
-r--r--r-- 1 silfverb kikosken 1850 4. heinä  17:34 test_file,v -r--r--r-- 1 silfverb kikosken 15975 26. elo    10:28 tokenizer.ll,v
>
>
drwxrwxr-x 3 silfverb kikosken 4096 22. syys   12:47 test -r--r--r-- 1 silfverb kikosken 18502 17. syys   15:16 tokenizer.ll,v -r--r--r-- 1 silfverb kikosken 15997 22. syys   12:39 tutorial-fragment.rst,v

/c/appl/ling/koskenni/cvsrepo/htwolc/test: yhteensä 8 drwxrwxr-x 2 silfverb kikosken 4096 22. syys   12:47 Attic -r--r--r-- 1 silfverb kikosken 3402 22. syys   12:47 gradation-rules.twol,v

 

Dependencies

Line: 106 to 113
 
  • a:? and a: match any pair in the alphabet having input-character a.
  • ?:a and :a match any pair in the alphabet having output-character a.
  • ? matches any pair in the alphabet.
Changed:
<
<
  • ?:? same as ?.
>
>
  • ?:? same as ?. You may also use : surrounded by white-space.
 
  • 0 matches the empty string.
Added:
>
>
Warning, important Pair-constructions like [:a may cause some problems. Now [ :a is preferable.

 By concatenating pairs, one can build longer regular expressions matching pairs of strings. If the alphabet is declared
Alphabet
Line: 116 to 125
  then the regular expression a N: e will match the pairs of strings a N:n e and a N:m e.
Changed:
<
<
Regular expressions can be grouped together using the parenthesis-constructions [ ... ] ans ( ... ). If R is a regular expression, then [ R ] matches exactly the same pairs of string as R does. The construction ( R ), on the other hand, matches the empty string, as well.
>
>
Regular expressions may be grouped together using the parenthesis-constructions [ ... ] and ( ... ). If R is a regular expression, then [ R ] matches exactly the same strings of pairs as R does. The construction ( R ), on the other hand, matches the empty string, as well.
  Grouping becomes important, when one uses unary regular expression operators. Unary operators like * have higher precedence, than concatenation. This means that e.g. a b* is equivalent to [ a ] [ b * ]. If one wants the * operator to apply to the whole expression a b one has to group the expressions a and b together i.e. [ a b ]*.

Revision 182008-09-05 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

OMorFi: hfst-twolc -- An Open-Source Two-Level Grammar Compiler

Line: 437 to 437
  This list contains features which, for the time being, are lacking from hfst-twolc, but will be added, or have been implemented differently from Xerox twolc, but will be changed. The missing features are gathered from Karttunen and Koskenniemi 1987 and Karttunen 1992.
Deleted:
<
<
  • Diacritics.
 

Partial implementations

Since this is an alpha-version of hfst-twolc, there are many features, that have limited functionality.

Changed:
<
<
The where ... ( matched | freely | mixed ) construction is implemented, but is partial in many respects. There is no support for using set-names in the =where=-part of the rule. Hence one can't write a rule
t:d <= Vx _ Vx; where Vx in Vowels;
where Vowels is the set a e i o u y ä ö to say that t becomes d between like vowels. Instead one has to write
>
>
The where ... ( matched | freely | mixed ) construction is implemented, but is partial in some respects. You can either write a rule with a variable Vx
 
Changed:
<
<
t:d <= Vx _ Vx; where Vx in (a e i o u y ä ö);
>
>
"Gradation of k to '" %^K:' <=> Vowel Vx _ Vx ClosedOffset ; where Vx in Vowel ; or write
"Gradation of k to '"
%^K:' <=> Vowel Vx _ Vx ClosedOffset ;
             where Vx in ( a e i o u y ä ö ) ;
but you can't embed the Vowel set in the range, i.e. rules like
"Gradation of k to '"
%^K:' <=> Vowel Vx _ Vx ClosedOffset ;
             where Vx in (Vowel) ;
don't work.
  There is no support for either the freely or mixed options. E.g.

Revision 172008-09-05 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

OMorFi: hfst-twolc -- An Open-Source Two-Level Grammar Compiler

Line: 264 to 264
  The number [1] means, that this is a warning of type 1. There are 6 types of warnings. These are
Changed:
<
<
  1. E.g. the grammar
>
>
[1] E.g. the grammar
 
Alphabet
        a b c ;
Line: 279 to 279
 [1] The pair a:b wasn't declared in the alphabet! a:b <= c _ ; The pair a:b was used, but not declared in the alphabet. Since only pairs, that have been declared in the alphabet match ?, failing to declare a:b would probably prohibit all pair-strings containing a:b, if the grammar has more than one rule.
Changed:
<
<
  1. (PARTIALLY UNIMPLEMENTED) E.g. the grammar
>
>
[2] (PARTIALLY UNIMPLEMENTED) E.g. the grammar
 
Alphabet
        a b c a:b ;
Line: 303 to 304
  The warning-message is given for the line after the actual redefining line. The cause for this is, that the parser of hfst-twolc eats the newlines after the set-declaration before compiling it.
Changed:
<
<
  1. (PARTIALLY UNIMPLEMENTED) The grammar
>
>
[3] (PARTIALLY UNIMPLEMENTED) The grammar
 
Alphabet
        a b c a:b ;
Line: 329 to 330
  The warning-message is given for the line after the actual redefining line. The cause for this is, that the parser of hfst-twolc eats the newlines after the set-declaration before compiling it.
Changed:
<
<
  1. Warnings of type 4, won't be given for now, since the compilation of rules has changed significantly, and this warning hasn't been reimplemented yet. Type 4 warnings warn about defining the same rule twice.
>
>
[4] Warnings of type 4, won't be given for now, since the compilation of rules has changed significantly, and this warning hasn't been reimplemented yet. Type 4 warnings warn about defining the same rule twice.

[7] Warning for a pair x:y where x is a diacritic and y is non-zero. Diacritics are always realised as aero, so y will be discarded.

 
Deleted:
<
<
  1. The grammar
 

Resolution of Conflicts between the Rules

A pair-string is accepted by a two-level grammar, iff it is accepted by each of the rues in the grammar. Hence there may be strings, that are accepted by some of the rules and rejected by others. While this is often intentional, there are at least two cases, where it has shown to be beneficial for the overall quality of the grammar to make some automatical modifications to the rules. These so called right- and left-arrow conflicts are handled by the mechanism of conflict-resolution in hfst-twolc.

Revision 162008-08-29 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

OMorFi: hfst-twolc -- An Open-Source Two-Level Grammar Compiler

Line: 469 to 469
 s:v => a _ ; t:u => a _ ;
Changed:
<
<
Rules defined with variables, may easily com into conflict with eachother. For now this is treated as any other rule-conflict. Consider the rule
>
>
Rules defined with variables, may easily come into conflict with eachother. For now this is treated as any other rule-conflict. Consider the rule
 
x:y => A _ A ; where A in (s t);
The subcases
Line: 480 to 480
  Conflict-resolution may be very slow.
Added:
>
>
Substitution of values for variables may produce new pairs , which haven't been declared in the alphabet. For now hfst-twolc can only warn about such new pairs occuring on the left side of the rule-operator.
 The OpenFst-implementation may be very slow.

Differences from Xerox twolc

Revision 152008-08-26 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

OMorFi: hfst-twolc -- An Open-Source Two-Level Grammar Compiler

Line: 32 to 32
  The program may be downloaded from the CVS-repository on Corpus. The path of the CVS-repository is /c/appl/ling/koskenni/cvsrepo/. Currently hfst-twolc is in the directory htwolc, which contains the files
Changed:
<
<
-rw-r--r-- 1 silfverb kikosken 6923 16. heinä 17:40 commandline.h drwxr-xr-x 2 silfverb kikosken 1024 18. heinä 18:31 CVS -rw-r--r-- 1 silfverb kikosken 11239 16. heinä 17:40 htwolc.yy -rw-r--r-- 1 silfverb kikosken 849 13. kesä 12:43 Makefile -rw-r--r-- 1 silfverb kikosken 5888 11. heinä 14:58 muutokset -rw-r--r-- 1 silfverb kikosken 25506 16. heinä 17:40 operations.C -rw-r--r-- 1 silfverb kikosken 13145 16. heinä 17:40 operations.h -rw-r--r-- 1 silfverb kikosken 45 6. touko 23:42 README -rw-r--r-- 1 silfverb kikosken 443 4. heinä 17:34 test_file -rw-r--r-- 1 silfverb kikosken 6955 10. heinä 18:41 tokenizer.ll
>
>
-r--r--r-- 1 silfverb kikosken 11231 26. elo    10:28 commandline.h,v -rw-r--r-- 1 silfverb kikosken 0 10. heinä  18:39 #cvs.wfl.corpus3.csc.fi.32192 -r--r--r-- 1 tpirinen omorf 8444 8. elo    14:21 finnish.twol,v -r--r--r-- 1 silfverb kikosken 48243 26. elo    10:28 htwolc.yy,v -r--r--r-- 1 silfverb kikosken 3998 26. elo    10:28 Makefile,v -r--r--r-- 1 silfverb kikosken 9299 25. heinä  12:41 muutokset,v -r--r--r-- 1 silfverb kikosken 43178 26. elo    10:28 operations.C,v -r--r--r-- 1 silfverb kikosken 26025 26. elo    10:28 operations.h,v -r--r--r-- 1 silfverb kikosken 611 26. elo    10:28 README,v -r--r--r-- 1 silfverb kikosken 1850 4. heinä  17:34 test_file,v -r--r--r-- 1 silfverb kikosken 15975 26. elo    10:28 tokenizer.ll,v
 

Dependencies

Revision 142008-08-07 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

OMorFi: hfst-twolc -- An Open-Source Two-Level Grammar Compiler

Line: 263 to 263
  The number [1] means, that this is a warning of type 1. There are 6 types of warnings. These are
Changed:
<
<
  1. Type 1. E.g. the grammar
>
>
  1. E.g. the grammar
 
Alphabet
        a b c ;
Line: 273 to 272
  "R1" a:b <= c _ ;
Changed:
<
<
Gives the warning
>
>
gives a type 1 warning
 
WARNING! LINE 7:
[1] The pair a:b wasn't declared in the alphabet!
a:b <= c _ ;
Added:
>
>
The pair a:b was used, but not declared in the alphabet. Since only pairs, that have been declared in the alphabet match ?, failing to declare a:b would probably prohibit all pair-strings containing a:b, if the grammar has more than one rule.
  1. (PARTIALLY UNIMPLEMENTED) E.g. the grammar
    Alphabet
            a b c a:b ;
    
    Sets
    
          A = a ;
          A = b ;
    
    Rules
    
    "R1"
    a:b <= c _ ;
    gives a type 2 warning
    WARNING! LINE 9:
    [2] You are redefining the set A
    
    ^ HERE
    
    Here the set A has been defined twice. This is not fully functional yet! It will warn, if the user attempts to define a set with the same name as an alphabet-symbol. The later definition will be used, when compiling rules.

The warning-message is given for the line after the actual redefining line. The cause for this is, that the parser of hfst-twolc eats the newlines after the set-declaration before compiling it.

  1. (PARTIALLY UNIMPLEMENTED) The grammar
    Alphabet
            a b c a:b ;
    
    Sets
    
          A = a ;
    
    Definitions
    
          A = a* ;
    
    Rules
    
    "R1"
    a:b <= c _ ;
    gives the type 3 warning
    WARNING! LINE 12:
    [3] You are redefining the expression or set A
    
    ^ HERE
    The symbol A is declared once as a set and another time as a definition (a warning is also issued, if there are two declarations of the same definition). This is not fully functional yet! It will warn, if the user attempts to define a set with the same name as an alphabet-symbol. The later definition will be used, when compiling rules.

The warning-message is given for the line after the actual redefining line. The cause for this is, that the parser of hfst-twolc eats the newlines after the set-declaration before compiling it.

  1. Warnings of type 4, won't be given for now, since the compilation of rules has changed significantly, and this warning hasn't been reimplemented yet. Type 4 warnings warn about defining the same rule twice.
 
Added:
>
>
  1. The grammar
 

Resolution of Conflicts between the Rules

A pair-string is accepted by a two-level grammar, iff it is accepted by each of the rues in the grammar. Hence there may be strings, that are accepted by some of the rules and rejected by others. While this is often intentional, there are at least two cases, where it has shown to be beneficial for the overall quality of the grammar to make some automatical modifications to the rules. These so called right- and left-arrow conflicts are handled by the mechanism of conflict-resolution in hfst-twolc.

Revision 132008-08-07 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

OMorFi: hfst-twolc -- An Open-Source Two-Level Grammar Compiler

Line: 28 to 28
 

Getting the Program

Added:
>
>
There is now a binary of the program available, as well as the source-code, on the Source code page on the hfst webpage (on the webpage of the Department of General Linguistics in Helsinki University).
 The program may be downloaded from the CVS-repository on Corpus. The path of the CVS-repository is /c/appl/ling/koskenni/cvsrepo/. Currently hfst-twolc is in the directory htwolc, which contains the files
-rw-r--r--  1 silfverb kikosken  6923 16. heinä  17:40 commandline.h
Line: 249 to 251
 

Error-Messages and Warnings

Added:
>
>

Warnings

The following is an example of a wring given by hfst-twolc

WARNING! LINE 7:
[1] The pair a:b wasn't declared in the alphabet!
a:b <= c _ ;
  ^ HERE
The program attempts to report the number of the line, which gives the warning and also point to the place, which gives the warning. Note, that the place and line given may not be accurate. When they're not, the problem is often on the previous line.

The number [1] means, that this is a warning of type 1. There are 6 types of warnings. These are

  1. Type 1. E.g. the grammar
    Alphabet
            a b c ;
    
    Rules
    
    "R1"
    a:b <= c _ ;
    Gives the warning
    WARNING! LINE 7:
    [1] The pair a:b wasn't declared in the alphabet!
    a:b <= c _ ;
 

Resolution of Conflicts between the Rules

A pair-string is accepted by a two-level grammar, iff it is accepted by each of the rues in the grammar. Hence there may be strings, that are accepted by some of the rules and rejected by others. While this is often intentional, there are at least two cases, where it has shown to be beneficial for the overall quality of the grammar to make some automatical modifications to the rules. These so called right- and left-arrow conflicts are handled by the mechanism of conflict-resolution in hfst-twolc.

Line: 353 to 383
 This list contains features which, for the time being, are lacking from hfst-twolc, but will be added, or have been implemented differently from Xerox twolc, but will be changed. The missing features are gathered from Karttunen and Koskenniemi 1987 and Karttunen 1992.

  • Diacritics.
Changed:
<
<
  • Rules containing variables. E.g.
>
>

Partial implementations

Since this is an alpha-version of hfst-twolc, there are many features, that have limited functionality.

The where ... ( matched | freely | mixed ) construction is implemented, but is partial in many respects. There is no support for using set-names in the =where=-part of the rule. Hence one can't write a rule

t:d <= Vx _ Vx; where Vx in Vowels;
where Vowels is the set a e i o u y ä ö to say that t becomes d between like vowels. Instead one has to write
t:d <= Vx _ Vx; where Vx in (a e i o u y ä ö);

There is no support for either the freely or mixed options. E.g.

X:Y => a _ ; where X in (s t) Y in (u v);
means the same as
X:Y => a _ ; where X in (s t) Y in (u v) matched;
i.e. is equivalent to the intersection of the rules
s:u => a _ ;
t:v => a _ ; 
Though there is no support for freely, the option can easily be simulated by writing the rule
 
Changed:
<
<
e:ẽ <= Nx _ Nx ; where Nx in Nasal ; Here Nasal is a set (in the Xerox meaning).
  • where ... ( matched | freely | mixed ) construction.
>
>
X:Y => a _ ; where X in (s t) and Y in (u v); This makes the rule equivalent to the intersection of the rules
s:u => a _ ;
t:v => a _ ;
s:v => a _ ;
t:u => a _ ; 

Rules defined with variables, may easily com into conflict with eachother. For now this is treated as any other rule-conflict. Consider the rule

x:y => A _ A ; where A in (s t);
The subcases
x:y => s _ A ;
x:y => t _ A ;
are in a right-arrow conflict with each-other. This is easily solved by conflict-resolution. The case of left-arrow rules is less fortunate. They may easily come into unresolvable conflict with each-other, when the center involves variables.

Conflict-resolution may be very slow.

The OpenFst-implementation may be very slow.

 

Differences from Xerox twolc

Revision 122008-07-23 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

OMorFi: hfst-twolc -- An Open-Source Two-Level Grammar Compiler

Line: 6 to 6
 

Usage

Changed:
<
<
htwolc [ --lexicon FILE ] --input FILE [ --output FILE ] [ --test_file FILE ] [ --test ] [ --no-report ] [ --resolve ]
>
>
htwolc --input FILE [ --output FILE ] [ --test_file FILE ] [ --test ] [ --no-report ] [ --resolve ]
 

Parameter name Meaning
Deleted:
<
<
lexicon the lexicon file. If omitted, the lexicon is read from STDIN.
 
input the rule file.
output If omitted, the resulting transducer is written to STDOUT.
test_file A file containing test-pairs for the grammar.
Line: 322 to 321
 

Rules with Different Centers.

Added:
>
>
Consider the rules
a:b => c _ ;
and
a <= c _ ;
These rules together prohibit the occurrence of the pair a:b anywhere, since a has to be realized as a after c, but this is the only position, where a could be realised as b.
 

List of Reserved Words

Revision 112008-07-18 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

OMorFi: hfst-twolc -- An Open-Source Two-Level Grammar Compiler

Line: 27 to 27
 
set of pairs
a subset of feasible character pairs (corresponds to the disjunction of the pairs listed in the definition).
input symbol
a token to be input to a FST; the left-hand side of a pair, i.e. a in a pair a:b
Added:
>
>

Getting the Program

The program may be downloaded from the CVS-repository on Corpus. The path of the CVS-repository is /c/appl/ling/koskenni/cvsrepo/. Currently hfst-twolc is in the directory htwolc, which contains the files

-rw-r--r--  1 silfverb kikosken  6923 16. heinä  17:40 commandline.h
drwxr-xr-x  2 silfverb kikosken  1024 18. heinä  18:31 CVS
-rw-r--r--  1 silfverb kikosken 11239 16. heinä  17:40 htwolc.yy
-rw-r--r--  1 silfverb kikosken   849 13. kesä   12:43 Makefile
-rw-r--r--  1 silfverb kikosken  5888 11. heinä  14:58 muutokset
-rw-r--r--  1 silfverb kikosken 25506 16. heinä  17:40 operations.C
-rw-r--r--  1 silfverb kikosken 13145 16. heinä  17:40 operations.h
-rw-r--r--  1 silfverb kikosken    45  6. touko  23:42 README
-rw-r--r--  1 silfverb kikosken   443  4. heinä  17:34 test_file
-rw-r--r--  1 silfverb kikosken  6955 10. heinä  18:41 tokenizer.ll

Dependencies

You should have hfst installed.

Installing the Program

If you're working on corpus, make should be sufficient. You may need to modify the variable

HFSTPATH=../hfst/
depending on, where you've got hfst installed.
 

Syntax

A twol-grammar consists of four parts: Alphabet, Sets, Definitions and Rules. Each part contains statements, that end in a ; character and comments, that begin with a ! character and span to the end of the line.

Line: 310 to 338
 

A Test-Tool for Grammars

Deleted:
<
<

Getting the Program

Installing the Program

 

Unimplemented Features

Revision 102008-07-17 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

OMorFi: hfst-twolc -- An Open-Source Two-Level Grammar Compiler

Line: 304 to 304
 ( ) * + $ $. ~ < > - "
Changed:
<
<
= 0 ^
>
>
= 0 ^ #
  The words and constructs may be used in rules by quoting with \. E.g. \? means question-mark, not any character-pair defined in the alphabet and \Sets is an ordinary name Sets not a declaration, that definitions of sets will follow. In the previous example \Sets could be used as a character in the alphabet, the name of a regular expression in the definition section of the grammar or the name of a set.
Line: 324 to 324
 e:ẽ <= Nx _ Nx ; where Nx in Nasal ; Here Nasal is a set (in the Xerox meaning).
  • where ... ( matched | freely | mixed ) construction.
Deleted:
<
<
  • A default symbol for word-boundaries (not clear, whether this will be implemented or not).
 

Differences from Xerox twolc

Revision 92008-07-17 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

OMorFi: hfst-twolc -- An Open-Source Two-Level Grammar Compiler

Line: 67 to 67
 
Added:
>
>
The rules in the example grammar are from Karttunen 1992. Many of the examples in this manual are taken either from Karttunen 1992 or Karttunen and Koskenniemi 1987.
 

Regular Expression Syntax

Any character-pair defined in the alphabet is a regular expression e.g. a or a:b. The following special pair-constructs are available:

Line: 222 to 224
 

Resolution of Conflicts between the Rules

Changed:
<
<
A pair-string is accepted by a two-level grammar, iff it is accepted by each of the rues in the grammar. Hence there may be strings, that are accepted by some of the rules and rejected by others. While this is often intentional, there are at least two cases, where it has shown to be beneficial for the overall quality of the grammar to make some automatical modifications to the rules. These are the so called right- and left-arrow conflicts and are handled in hfst-twolc by the mechanism of conflict-resolution.
>
>
A pair-string is accepted by a two-level grammar, iff it is accepted by each of the rues in the grammar. Hence there may be strings, that are accepted by some of the rules and rejected by others. While this is often intentional, there are at least two cases, where it has shown to be beneficial for the overall quality of the grammar to make some automatical modifications to the rules. These so called right- and left-arrow conflicts are handled by the mechanism of conflict-resolution in hfst-twolc.
 
Changed:
<
<
A situation, where one rule accepts a pair-string and another rejects it, shouldn't always be regarded as a conflict. In hfst-twolc it is regarded as a conflict, only if both of the rules are actually applied in the sense discussed in Yli-Jyrä and Koskenniemi 2006. There are many kinds of conflicts, but for the time-being only right-arrow conflicts and left-arrow-conflicts are automatically resolved by hfst-twolc.
>
>
A situation, where one rule accepts a pair-string and another rejects it, shouldn't always be regarded as a conflict. In hfst-twolc it is regarded as a conflict, only if both of the rules are actually applied in the sense discussed in Yli-Jyrä and Koskenniemi 2006. Normal rule-interaction constrains the surface-realizations of some input-form, but do not loose all of them. In contrast to this rule-conflicts often filter away some input-forms completely. There are many kinds of conflicts, but for the time-being only right-arrow conflicts and left-arrow-conflicts are automatically resolved by hfst-twolc.
  Unless hfst-twolc is run with the commandline-parameter --no-report, it will report all rule-conflicts, it observes and if it is run with the parameter --resolve, it will resolve the conflicts.
Line: 276 to 278
 

Left/Right -Arrow Conflicts

Added:
>
>
Besides left- and right-arrow conflicts, there are other kinds of unfortunate interactions between rules. Currently hfst-twolc neither reports, nor fixes such interactions, which makes it important for the grammar-writer to be aware of the possibility of them. Left/right -arrow conflicts involve operators of different types and come in two flavors.
 

Rules with Identical Centers

Added:
>
>
Consider the rules
a:b => c _ ;
and
a:b <= d _ ;
The first rule requires, that the a:b pair is immediately preceded by the pair c. The second rule requires, that a be realised as b always when it is preceded by d. Together the rules prohibit the occurrence of an input-character a before the input-character d.
 

Rules with Different Centers.

List of Reserved Words

Line: 302 to 316
 

Unimplemented Features

Changed:
<
<
This list contains features which, for the time being, are lacking from hfst-twolc, but will be added, or have been implemented differently from Xerox twolc, but will be changed.
>
>
This list contains features which, for the time being, are lacking from hfst-twolc, but will be added, or have been implemented differently from Xerox twolc, but will be changed. The missing features are gathered from Karttunen and Koskenniemi 1987 and Karttunen 1992.
 
  • Diacritics.
  • Rules containing variables. E.g.
Line: 321 to 336
 
Added:
>
>
 
  • A. Yli-Jyrä, K. Koskenniemi, Compiling Generalized Two-Level Rules and Grammars, Advances in Natural Language Processing, Springer Berlin/Heidelberg, pages 174-185, 2006

Revision 82008-07-16 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"
Changed:
<
<

OMorFi: htwolc -- An Open-source Two-level Grammar Compiler

>
>

OMorFi: hfst-twolc -- An Open-Source Two-Level Grammar Compiler

 

Usage

Changed:
<
<
htwolc [ --lexicon FILE ] --input FILE [ --output FILE ] [ --test_file FILE ] [ --test ]
>
>
htwolc [ --lexicon FILE ] --input FILE [ --output FILE ] [ --test_file FILE ] [ --test ] [ --no-report ] [ --resolve ]
 
Changed:
<
<
Parameter name function
lexicon the lexicon file. If omitted, the lxicon is read from STDIN.
>
>
Parameter name Meaning
lexicon the lexicon file. If omitted, the lexicon is read from STDIN.
 
input the rule file.
output If omitted, the resulting transducer is written to STDOUT.
test_file A file containing test-pairs for the grammar.
Changed:
<
<
test Toggle test-mode. If this parameter is present, the rues won't be compiled, but tested instad.
>
>
test Toggle test-mode. If this parameter is present, the rules won't be compiled, but tested instead.
no-report Don't warn about conflicts between rules. If omitted, all rule-conflicts will give a warning.
resolve Attempt to resolve conflicts between rules. If omitted, conflicts aren't resolved.
 

Outline

Line: 27 to 29
 

Syntax

Changed:
<
<
A twol-grammar consists of four parts: Alphabet, Sets, Definitions and Rules. Each part contains statements, that end in a ; character and comments, that begin with a ! character and span to the end of a line.
>
>
A twol-grammar consists of four parts: Alphabet, Sets, Definitions and Rules. Each part contains statements, that end in a ; character and comments, that begin with a ! character and span to the end of the line.
 
Alphabet
Line: 65 to 67
 
Changed:
<
<

Regular expression syntax

>
>

Regular Expression Syntax

  Any character-pair defined in the alphabet is a regular expression e.g. a or a:b. The following special pair-constructs are available:
Line: 75 to 77
 
  • ?:? same as ?.
  • 0 matches the empty string.
Changed:
<
<
Concatenating pairs, one can build longer regular expressions matcing pairs of strings. Is the alphabet is declared
>
>
By concatenating pairs, one can build longer regular expressions matching pairs of strings. If the alphabet is declared
 
Alphabet
a N:n N:m e
Line: 86 to 88
  Grouping becomes important, when one uses unary regular expression operators. Unary operators like * have higher precedence, than concatenation. This means that e.g. a b* is equivalent to [ a ] [ b * ]. If one wants the * operator to apply to the whole expression a b one has to group the expressions a and b together i.e. [ a b ]*.
Changed:
<
<
There are seven unary regular-expression operators in htwolc for the time being. Let the Alphabet be [ a N:n N:m o] and let R denote a regular expression. The unary operators are:
>
>
There are seven unary regular-expression operators in hfst-twolc for the time being. Let the Alphabet be [ a N:n N:m o] and let R denote a regular expression. The unary operators are:
 
  • The power-operator ^INTEGER, which is equivalent to concatenation of the argument-expression with itself INTEGER times. E.g. a^3 is equivalent a a a.
  • The containment-operator $. The regular-expression $R matches any string containing at least one substring matched by R. E.g. $a is equivalent to [ a N:n N:m e ]* a [ a N:n N:m e]*, with the alphabet defined above.
Line: 100 to 102
  Let R and S be regular expressions. The binary operators are:
  • The disjunction-operator |. The language R | S matches any string matched by R or S and only those.
Changed:
<
<
  • The conjunction-operator &. The language R & S mathces any string matched by both R and S and only those.
>
>
  • The conjunction-operator &. The language R & S matches any string matched by both R and S and only those.
 
  • The difference-operator -. The language R - S matches any string matched by R, but not by S and only those.

By default the binary operations bind from the left. Hence a - a - a is equivalent to [ a - a ] - a i.e. matches the empty language. If the binary operators would bind from the right, then a - a - a would be equivalent to a - [ a - a ] i.e. equivalent to a.

Changed:
<
<

Precedence

>
>

Operator Precedence

  The operators in htwolc have different precedence. As a rule of thumb unary operators are the strongest, then concatenation and last binary operators. The constructions [ ... ] and ( ... ) override all other precedence rules.
Line: 120 to 122
 
[  [ ~[ a ^ 3]  ] b ] | [ c [ d* ]  ]
Changed:
<
<

The alphabet

>
>

The Alphabet

  The first part specifies the alphabet of the rules. The alphabet consists of pairs consisting of a input-character and a output-character like a:a or N:m. If the input-character and output-character are the same, it is customary to denote the pair by the input-symbol, so a:a is usually written a.
Line: 128 to 130
  Any non-empty string of non-white-space UTF-8 characters, that isn't a reserved word, is a valid alphabet-character. For now this means, that the characters shouldn't contain newlines, spaces, tabs or carriage-returns and shouldn't be found in the section List of reserved words below.
Changed:
<
<

The sets

>
>

The Sets

  The second part of the grammar specifies named character-ranges like
Line: 140 to 142
 t:ө <= :Vowel _ :Vowel ;
Changed:
<
<

Definitions

>
>

The Definitions

  The third part of the grammar specifies named regular expressions, which may be used as a part of definitions of rules, e.g.
Line: 149 to 151
  The regular-expression syntax is the same as the syntax used in the two-level rules of the grammar. It is possible to define a named regular expression having the same name as a set or alphabet character. It will over-shadow the declaration of the set.
Changed:
<
<

The rules

>
>

The Rules

 
Changed:
<
<

Error-messages and warnings

>
>

Ordinary Two-Level Rules

 
Added:
>
>
Two-level rules consist of a center, a rule-operator and contexts. The center is a pair of characters, and a context consists of two regular expressions separated by an underscore. Schematically
a:b OP L1 _ R1 ;
       L2 _ R1 ;
         ...
       Ln _ Rn ;
 
Added:
>
>
A rule has to have at least one context and it may have as many as are needed.
 
Changed:
<
<

List of reserved words

>
>
The rules are constraints, regulating the distribution of the center-pairs according to the rule-operator and contexts given. Four different kinds of rules-operators may be used in hfst-twolc
<=, =>, <=> and \<=
The final context, which is compiled into the transducer representing the two-level rule is the union of the contexts given.
 
Added:
>
>
Right-arrow rules constrain the distribution of a symbol-pair by specifying, that it may only occur in a specific context (or some specific contexts). Let the set V be the set of vowels in some language. An example of a right-arrow rule is
 
Changed:
<
<
Alphabet Definitions Rules Sets ! ; ? : _ | >     < <=> \<= [ ] ( ) * + $ $. ~ < > - " = 0 ^
>
>
I:j => :V _ :V ; It states, that the input-character I can be realised as j only in a contex, where it is surrounded by output vowels, i.e. that the occurrence of the pair I:j is limited to positions between surface-vowels.

The context :V _ :V in the example is automatically extended to a so called total context, by hfst-twolc. This means that, when the rule is compiled, the context will become ?* :V _ :V ?*. This applies to all kinds of rule-operators.

Left-arrow rules constrain the set of output-characters corresponding to an input-character in some context. An example of a left-arrow rule is

N:m <= _ p: ;
It states, that an input-character N has to be realized as the output-character m if it is followed by some pair with input-character p.

Left-arrow rules differ from right-arrow rules, because they are asymmetric with regard to the input- and output-level of pair-strings. The right-arrow example above, doesn't limit the input-character of a pair preceding p:, it only limits the output-character, if the input-character is N. Such an asymmetry is not present in left-arrow rules, which limit a particular pair into a particular kind of context.

Left/right -arrow rules, give a necessary and sufficient conditions for the realization of an input-character as some output-character. An example of a left/right -arrow rule is

k:' <=> :a :a _ :a ;
which states, that k:' is realized as ' exactly in contexts where two output a phones precede and one follows (this describes a convention of Finnish orthography stemming from consonant degradation). Any left/right arrow rule is equivalent to the joined effect of the corresponding left- and right-arrow rules. Hence the example is equivalent to the pair of rules
k:' <= :a :a _ :a ;
and
k:' => :a :a _ :a ;
 
Deleted:
<
<
The words and constructs may be used in rules by quoting with \. E.g. \? means question-mark, not any character-pair defined in the alphabet and \Sets is an ordinary name Sets not a declaration, that definitions of sets will follow. In the previous example \Sets could be used as a character in the alphabet, the name of a regular expression in the definition section of the grammar or the name of a set.
 
Changed:
<
<

Types of rules

Ordinary twol-rules

Generalized context-restrictions

>
>
Prohibition rules disallow the realization of an input-character as some output-character in some contexts. Let again V denote the set of vowels. An example of a prohibition rule is
I:i \<= :V _ :V ;
which states, that the input-character I may not be realized as i between output-vowels.

Like right-arrow rules, prohibition rules are symmetric with respect to the input- and output-level of pair-strings. In fact it is often possible to state a particular constraint both as a prohibition rule concerning some pair and a left-arrow rule concerning an other. If the input-character I may only be realized as i or j, then the rules

I:i \<= :V _ :V ;
and
I:j => :V _ :V ;
state the exactly same constraint. Still, if the number of realizations is greater, it may be much easier to state the constraint using one of the operators than the other.

Generalized Context-Restrictions

Special Rule-Constructs

 
Changed:
<
<

Special rule-constructs

>
>

Error-Messages and Warnings

 
Changed:
<
<

Resolution of conflicts between the rules

>
>

Resolution of Conflicts between the Rules

  A pair-string is accepted by a two-level grammar, iff it is accepted by each of the rues in the grammar. Hence there may be strings, that are accepted by some of the rules and rejected by others. While this is often intentional, there are at least two cases, where it has shown to be beneficial for the overall quality of the grammar to make some automatical modifications to the rules. These are the so called right- and left-arrow conflicts and are handled in hfst-twolc by the mechanism of conflict-resolution.
Line: 211 to 256
 

Left-Arrow Conflicts

Changed:
<
<
Left-arrow conflicts occur between right-arrow rules, that deal with the same center-input-character, but different center-output-characters and non-disjoint contexts. Let X denote the set c d. Consider the rules
>
>
Left-arrow conflicts occur between left-arrow rules, that deal with the same center-input-character, but different center-output-characters and non-disjoint contexts. Let X denote the set c d. Consider the rules
 
"Rule 3"
a:b <= c _ ;
Line: 224 to 269
  In the example, Rule 3 may be regarded as a special case of Rule 4, since the context c _ is a sub-context of the more general X _. This might not be the case though. The contexts might be such, that neither is a sub-context of the other. This makes left-arrow-conflicts more complicated than right-arrow-conflicts.
Added:
>
>
The approach taken in hfst-twolc is to warn about all left-arrow conflicts, but only fix those left-arrow conflicts, where one of the rules is a special case of the other. The conflict is fixed by modifying the more general rule so, that it only applies in contexts, where the more specific rule doesn't apply. In the example above, the resolution-process doesn't effect Rule 3, but changes Rule 4, so that it becomes equivalent with the rule
 
a <= d _ ;
 
Added:
>
>

Left/Right -Arrow Conflicts

 
Added:
>
>

Rules with Identical Centers

 
Changed:
<
<

Left-Right-Arrow Conflicts

>
>

Rules with Different Centers.

 
Changed:
<
<

Rules with identical centers

>
>

List of Reserved Words

 
Changed:
<
<

Rules with different centers.

>
>
Alphabet  Definitions  Rules  Sets 
!         ;            ?      :        
_         |            =>     <=        
<=>       \<=          [      ]
(         )            *      +
$         $.           ~      <
>         -            "      \
=         0           ^
The words and constructs may be used in rules by quoting with \. E.g. \? means question-mark, not any character-pair defined in the alphabet and \Sets is an ordinary name Sets not a declaration, that definitions of sets will follow. In the previous example \Sets could be used as a character in the alphabet, the name of a regular expression in the definition section of the grammar or the name of a set.
 
Changed:
<
<

A test-tool for grammars

>
>

A Test-Tool for Grammars

 
Changed:
<
<

Getting the program

>
>

Getting the Program

 
Changed:
<
<

Installing

>
>

Installing the Program

 
Changed:
<
<

Unimplemented features

>
>

Unimplemented Features

  This list contains features which, for the time being, are lacking from hfst-twolc, but will be added, or have been implemented differently from Xerox twolc, but will be changed.

  • Diacritics.
Deleted:
<
<
  • One should be able to freely insert newlines in regular expressions.
  • In rules with multiple contexts, different contexts should be separated by a semicolon ;, not by an obligatory newline.
 
  • Rules containing variables. E.g.
    e:&#7869; <= Nx _ Nx ; where Nx in Nasal ;
    Here Nasal is a set (in the Xerox meaning).
  • where ... ( matched | freely | mixed ) construction.
  • A default symbol for word-boundaries (not clear, whether this will be implemented or not).
Changed:
<
<
  • Resolution of conflicts between the rules.
>
>
 

Differences from Xerox twolc

This list contains features, which are intended to differ from corresponding features in the Xerox twolc program.

Line: 262 to 321
 
Changed:
<
<
  • A. Yli-Jyrä, K. Koskenniemi, Compiling Generalized Two-Level Rules and Grammars, Advances in Natural Language Processing, Springer Berlin/Heidelberg, pages 174-185, 2006
>
>
  • A. Yli-Jyrä, K. Koskenniemi, Compiling Generalized Two-Level Rules and Grammars, Advances in Natural Language Processing, Springer Berlin/Heidelberg, pages 174-185, 2006
 

Revision 72008-07-15 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

OMorFi: htwolc -- An Open-source Two-level Grammar Compiler

Line: 183 to 183
  Unless hfst-twolc is run with the commandline-parameter --no-report, it will report all rule-conflicts, it observes and if it is run with the parameter --resolve, it will resolve the conflicts.
Changed:
<
<
The examples given below of right-arrow and left-arrow conflicts are very similar to those given in Karttunen and Koskenniemi 1987.
>
>
The examples given below of right-arrow and left-arrow conflicts are very similar to those given in Karttunen, Koskenniemi and Kaplan 1987.
 

Right-Arrow Conflicts

Line: 222 to 222
  Rule 3 requires, that an input a be realised as a b following c. The problem is that Rule 4 requires, that it be realised as a following any pair in X:X, among others c. Hence the total effect of the rules is to disallow the occurrence of a pair with input-character a before the pair c.
Changed:
<
<
In the example, Rule 3 may be regarded as a special case of Rule 4, since the context c _ is a sub-context of the more general context X _
>
>
In the example, Rule 3 may be regarded as a special case of Rule 4, since the context c _ is a sub-context of the more general X _. This might not be the case though. The contexts might be such, that neither is a sub-context of the other. This makes left-arrow-conflicts more complicated than right-arrow-conflicts.

 

Left-Right-Arrow Conflicts

Revision 62008-07-15 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

OMorFi: htwolc -- An Open-source Two-level Grammar Compiler

Line: 179 to 179
  A pair-string is accepted by a two-level grammar, iff it is accepted by each of the rues in the grammar. Hence there may be strings, that are accepted by some of the rules and rejected by others. While this is often intentional, there are at least two cases, where it has shown to be beneficial for the overall quality of the grammar to make some automatical modifications to the rules. These are the so called right- and left-arrow conflicts and are handled in hfst-twolc by the mechanism of conflict-resolution.
Changed:
<
<
A situation, where one rule accepts a pair-string and another rejects it, shouldn't always be regarded as a conflict. In hfst-twolc it is regarded as a conflict, only if both of the rules are actually applied in the sense discussed in
>
>
A situation, where one rule accepts a pair-string and another rejects it, shouldn't always be regarded as a conflict. In hfst-twolc it is regarded as a conflict, only if both of the rules are actually applied in the sense discussed in Yli-Jyrä and Koskenniemi 2006. There are many kinds of conflicts, but for the time-being only right-arrow conflicts and left-arrow-conflicts are automatically resolved by hfst-twolc.

Unless hfst-twolc is run with the commandline-parameter --no-report, it will report all rule-conflicts, it observes and if it is run with the parameter --resolve, it will resolve the conflicts.

The examples given below of right-arrow and left-arrow conflicts are very similar to those given in Karttunen and Koskenniemi 1987.

 

Right-Arrow Conflicts

Changed:
<
<
Consider the rules
>
>
Right-arrow conflicts occur between right-arrow rules (or left-right-arrow rules) with identical centers. Consider the rules
 
"Rule 1"
a:b => c _ ;
Line: 192 to 196
 a:b => d _ ;
Added:
>
>
Since Rule 1 requires, that all pairs a:b have to be preceeded by c and Rule 2, that they have to be preceeded by d, their intersection disallows all occurrences of a:b. This may be considered to be an accident.

When hfst-twolc encounters rules, that are in right-arrow-conflict, it reports and resolves the conflict There is a => conflict between the rules Rule1 and Rule2 with respect to the center a:b. Resolving the conflict by joining contexts. by collapsing the rules into a single rule a:b => c _ ; d _ ;

 

Left-Arrow Conflicts

Added:
>
>
Left-arrow conflicts occur between right-arrow rules, that deal with the same center-input-character, but different center-output-characters and non-disjoint contexts. Let X denote the set c d. Consider the rules
"Rule 3"
a:b <= c _ ;

"Rule 4"
a <= X _ ;

Rule 3 requires, that an input a be realised as a b following c. The problem is that Rule 4 requires, that it be realised as a following any pair in X:X, among others c. Hence the total effect of the rules is to disallow the occurrence of a pair with input-character a before the pair c.

In the example, Rule 3 may be regarded as a special case of Rule 4, since the context c _ is a sub-context of the more general context X _

Left-Right-Arrow Conflicts

Rules with identical centers

Rules with different centers.

 

A test-tool for grammars

Getting the program

Revision 52008-07-15 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

OMorFi: htwolc -- An Open-source Two-level Grammar Compiler

Line: 177 to 177
 

Resolution of conflicts between the rules

Added:
>
>
A pair-string is accepted by a two-level grammar, iff it is accepted by each of the rues in the grammar. Hence there may be strings, that are accepted by some of the rules and rejected by others. While this is often intentional, there are at least two cases, where it has shown to be beneficial for the overall quality of the grammar to make some automatical modifications to the rules. These are the so called right- and left-arrow conflicts and are handled in hfst-twolc by the mechanism of conflict-resolution.

A situation, where one rule accepts a pair-string and another rejects it, shouldn't always be regarded as a conflict. In hfst-twolc it is regarded as a conflict, only if both of the rules are actually applied in the sense discussed in

Right-Arrow Conflicts

Consider the rules

"Rule 1"
a:b => c _ ;

"Rule 2"
a:b => d _ ;

Left-Arrow Conflicts

 

A test-tool for grammars

Getting the program

Line: 205 to 222
 

References

Added:
>
>
  • A. Yli-Jyrä, K. Koskenniemi, Compiling Generalized Two-Level Rules and Grammars, Advances in Natural Language Processing, Springer Berlin/Heidelberg, pages 174-185, 2006
 

Revision 42008-07-08 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"

OMorFi: htwolc -- An Open-source Two-level Grammar Compiler

Line: 36 to 36
 a b c d e f g h i j k l m n o p q r s t u v w x y z å ä ö N:n N:m ;

Sets

Changed:
<
<
Consonant = b c d f g h j k l m n p q r s t v w x z N:m N:n ;
>
>
Consonant = b c d f g h j k l m n p q r s t v w x z m n ;
 Vowel = a e i o u y å ä ö ;

Definitions

Changed:
<
<
ClosedSyllable = Vowel+ [ ~Vowel ]+ ;
>
>
ClosedSyllable = :Vowel+ [ ~:Vowel ]+ ;
  Rules
Line: 65 to 65
 
Changed:
<
<

The alphabet

The first part specifies the alphabet of the rules. The alphabet consists of pairs consisting of a input-character and a output-character like a:a or N:m. If the input-character and output-character are the same, it is customary to denote the pair by the input-symbol, so a:a is usually written a.

Every pair of character referred to in some of the rules, has to be declared in the alphabet. Otherwise a warning will be issued. The grammar will still be compiled, but the rules may be compiled erroneously. E.g. the any-character ? denotes any pair declared in the alphabet and only those. Hence ? won't match pairs, that aren't declared in the alphabet.

Any non-empty string of non-white-space UTF-8 characters, that isn't a reserved word, is a valid alphabet-character. For now this means, that the characters shouldn't contain newlines, spaces, tabs or carriage-returns and shouldn't be found in the section List of reserved words below.

The sets

The second part of the grammar specifies named character-ranges like

Vowel  = a e i o u y å ä ö ;

When the symbol Vowel is encountered in a rule, it will be translated to the range

[a | e | i | o | u | y | å | ä | ö]
Hence e.g. Vowel+ in a rule means one or more characters form this set.

The same rules apply for set-names as for alphabet-characters.

It is possible to define a set having the same name as a character in the alphabet. This means that the character will be recognized as itself, when input is read when using the compiled grammar. In the declaration of rules, on the other hand, the name is be considered to denote a set and will be expanded. Confusing.

Definitions

The third part of the grammar specifies named regular expressions, which may be used as a part of definitions of rules, e.g.

ClosedSyllable = Vowel+ [ ~Vowel ]+ ;

The regular-expression syntax is the same as the syntax used in the two-level rules of the grammar. It is possible to define a named regular expression having the same name as a set or alphabet character. It will over-shadow the declaration of the set.

The rules

Error-messages and warnings

Regular expression syntax

>
>

Regular expression syntax

  Any character-pair defined in the alphabet is a regular expression e.g. a or a:b. The following special pair-constructs are available:
Line: 143 to 105
  By default the binary operations bind from the left. Hence a - a - a is equivalent to [ a - a ] - a i.e. matches the empty language. If the binary operators would bind from the right, then a - a - a would be equivalent to a - [ a - a ] i.e. equivalent to a.
Changed:
<
<

Precedence

>
>

Precedence

  The operators in htwolc have different precedence. As a rule of thumb unary operators are the strongest, then concatenation and last binary operators. The constructions [ ... ] and ( ... ) override all other precedence rules.
Line: 158 to 120
 
[  [ ~[ a ^ 3]  ] b ] | [ c [ d* ]  ]
Added:
>
>

The alphabet

The first part specifies the alphabet of the rules. The alphabet consists of pairs consisting of a input-character and a output-character like a:a or N:m. If the input-character and output-character are the same, it is customary to denote the pair by the input-symbol, so a:a is usually written a.

Every pair of character referred to in some of the rules, has to be declared in the alphabet. Otherwise a warning will be issued. The grammar will still be compiled, but the rules may be compiled erroneously. E.g. the any-character ? denotes any pair declared in the alphabet and only those. Hence ? won't match pairs, that aren't declared in the alphabet.

Any non-empty string of non-white-space UTF-8 characters, that isn't a reserved word, is a valid alphabet-character. For now this means, that the characters shouldn't contain newlines, spaces, tabs or carriage-returns and shouldn't be found in the section List of reserved words below.

The sets

The second part of the grammar specifies named character-ranges like

Vowel  = a e i o u y å ä ö ;

Sets may be used in rules as a short-hand for collections of character-pairs. Perhaps one might want write a rule, which states, that the phoneme t is realised as its voiced fricative counter-part ө between two phonemes, which are realised as vowels. This could be accomplished by a rule

t:&#1257; <= :Vowel _ :Vowel ;

Definitions

The third part of the grammar specifies named regular expressions, which may be used as a part of definitions of rules, e.g.

ClosedSyllable = Vowel+ [ ~Vowel ]+ ;

The regular-expression syntax is the same as the syntax used in the two-level rules of the grammar. It is possible to define a named regular expression having the same name as a set or alphabet character. It will over-shadow the declaration of the set.

The rules

Error-messages and warnings

 

List of reserved words

Line: 191 to 187
  This list contains features which, for the time being, are lacking from hfst-twolc, but will be added, or have been implemented differently from Xerox twolc, but will be changed.
Deleted:
<
<
  • Sets should be implemented as collections of single characters. E.g.
    Vowel = a e i o u y å ä ö ;
    is a set of vowel-characters. Using the set Vowel one can define different regular languages like
    • Vowel the language consisting of all pairs in the alphabet, where both the input and output character is in the set Vowel.
    • :Vowel the language consisting of all pairs in the alphabet, where the ouput character is in the set Vowel.
 
  • Diacritics.
  • One should be able to freely insert newlines in regular expressions.
  • In rules with multiple contexts, different contexts should be separated by a semicolon ;, not by an obligatory newline.

Revision 32008-07-02 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"
Changed:
<
<

HFST: Two-Level Rule Compiler

>
>

OMorFi: htwolc -- An Open-source Two-level Grammar Compiler

 
Changed:
<
<
see also:

Warning: Can't find topic KitWiki.OMorFiHtwolc

>
>
 
Changed:
<
<
>
>

Usage

htwolc [ --lexicon FILE ] --input FILE [ --output FILE ] [ --test_file FILE ] [ --test ]

Parameter name function
lexicon the lexicon file. If omitted, the lxicon is read from STDIN.
input the rule file.
output If omitted, the resulting transducer is written to STDOUT.
test_file A file containing test-pairs for the grammar.
test Toggle test-mode. If this parameter is present, the rues won't be compiled, but tested instad.

Outline

Terms and concepts:

input string
the string to be transformed by a FST (in Xerox terminology upper string; in SFST terminology analysis string, sometimes the deep string)
output string
the string into which the FST transforms the input string (in Xerox terminology lower string; in SFST terminology surface string)
set of characters
a set of characters (in SFST terminology range but the word "range" would imply the inclusion of all members between the two extremes)
set of pairs
a subset of feasible character pairs (corresponds to the disjunction of the pairs listed in the definition).
input symbol
a token to be input to a FST; the left-hand side of a pair, i.e. a in a pair a:b

Syntax

A twol-grammar consists of four parts: Alphabet, Sets, Definitions and Rules. Each part contains statements, that end in a ; character and comments, that begin with a ! character and span to the end of a line.

Alphabet

! The alphabet should contain all pairs used in the rules.
! Characters consist of strings of utf-8 characters. No white-space, though!
a b c d e f g h i j k l m n o p q r s t u v w x y z å ä ö N:n N:m ;

Sets
Consonant = b c d f g h j k l m n p q r s t v w x z N:m N:n ;
Vowel = a e i o u y å ä ö ;

Definitions

ClosedSyllable = Vowel+ [ ~Vowel ]+ ;

Rules

! input/output -pairs for testing the rule-set:

!input:  k a N p a n 
!output: k a m m a n

!input:  k a N T a n 
!output: k a n n a n

!input:  k a m p i 
!output: k a m p i

"N:m before input-character p"
! A common morpho-phonetic phenomenon
N:m <=> _ p: ;

"Degradation of p to m after input-character N"
p:m <=> N: _ ;

The alphabet

The first part specifies the alphabet of the rules. The alphabet consists of pairs consisting of a input-character and a output-character like a:a or N:m. If the input-character and output-character are the same, it is customary to denote the pair by the input-symbol, so a:a is usually written a.

Every pair of character referred to in some of the rules, has to be declared in the alphabet. Otherwise a warning will be issued. The grammar will still be compiled, but the rules may be compiled erroneously. E.g. the any-character ? denotes any pair declared in the alphabet and only those. Hence ? won't match pairs, that aren't declared in the alphabet.

Any non-empty string of non-white-space UTF-8 characters, that isn't a reserved word, is a valid alphabet-character. For now this means, that the characters shouldn't contain newlines, spaces, tabs or carriage-returns and shouldn't be found in the section List of reserved words below.

The sets

The second part of the grammar specifies named character-ranges like

Vowel  = a e i o u y å ä ö ;

When the symbol Vowel is encountered in a rule, it will be translated to the range

[a | e | i | o | u | y | å | ä | ö]
Hence e.g. Vowel+ in a rule means one or more characters form this set.

The same rules apply for set-names as for alphabet-characters.

It is possible to define a set having the same name as a character in the alphabet. This means that the character will be recognized as itself, when input is read when using the compiled grammar. In the declaration of rules, on the other hand, the name is be considered to denote a set and will be expanded. Confusing.

Definitions

The third part of the grammar specifies named regular expressions, which may be used as a part of definitions of rules, e.g.

ClosedSyllable = Vowel+ [ ~Vowel ]+ ;

The regular-expression syntax is the same as the syntax used in the two-level rules of the grammar. It is possible to define a named regular expression having the same name as a set or alphabet character. It will over-shadow the declaration of the set.

The rules

Error-messages and warnings

 
Changed:
<
<
-- KristerLinden - 27 May 2008
>
>

Regular expression syntax

Any character-pair defined in the alphabet is a regular expression e.g. a or a:b. The following special pair-constructs are available:

  • a:? and a: match any pair in the alphabet having input-character a.
  • ?:a and :a match any pair in the alphabet having output-character a.
  • ? matches any pair in the alphabet.
  • ?:? same as ?.
  • 0 matches the empty string.

Concatenating pairs, one can build longer regular expressions matcing pairs of strings. Is the alphabet is declared

Alphabet
a N:n N:m e
then the regular expression a N: e will match the pairs of strings a N:n e and a N:m e.

Regular expressions can be grouped together using the parenthesis-constructions [ ... ] ans ( ... ). If R is a regular expression, then [ R ] matches exactly the same pairs of string as R does. The construction ( R ), on the other hand, matches the empty string, as well.

Grouping becomes important, when one uses unary regular expression operators. Unary operators like * have higher precedence, than concatenation. This means that e.g. a b* is equivalent to [ a ] [ b * ]. If one wants the * operator to apply to the whole expression a b one has to group the expressions a and b together i.e. [ a b ]*.

There are seven unary regular-expression operators in htwolc for the time being. Let the Alphabet be [ a N:n N:m o] and let R denote a regular expression. The unary operators are:

  • The power-operator ^INTEGER, which is equivalent to concatenation of the argument-expression with itself INTEGER times. E.g. a^3 is equivalent a a a.
  • The containment-operator $. The regular-expression $R matches any string containing at least one substring matched by R. E.g. $a is equivalent to [ a N:n N:m e ]* a [ a N:n N:m e]*, with the alphabet defined above.
  • The exact containment-operator $. is similar to the containment operator, but the mathcing strings have to contain exactly one substring matching R. E.g. $.a is equivalent to [ N:n N:m e ]* a [ N:n N:m e]* with the Alphabet defined above.
  • The term-complement-operator \. The term-complement of R is the language \R containing every pair, that is not matched by R. E.g. \a is equivalent to [ N:n N:m e ] with the Alphabet defined above. Note that the term-complement is not the same thing as the negation of a language.
  • The negation-operator ~. The negation of a regular-expression R contains all strings not matched by R.
  • The Kleene-star *. The language R* matches any string, that is a concatenation of any number of string from R. Note that the empty string, which is the concatenation of zero strings also matched. E.g. a* matches the empty string, a, a a, a a a and so on.
  • The plus-operator resembles the *, but it only matches strings, which are concatenation of a positive number of strings from R. Consequently R+ matches the empty string, iff R matches the empty string. E.g. a+ matches a, a a, a a a and so on.

In addition to the unary operators there are three binary operators, which may be used to build regular expressions out of existing ones. Binary operators have the lowest precedence. Hence, when using the disjunction-operation |, e.g. a b* | c d is equivalent to [ a b* ] | [ c d ] and will match anything matched by a b* or by c d. One can group expressions together so a [ b * | c ] d will match a string beginning with a followed by zero or more b symbols or a c and ending with a d.

Let R and S be regular expressions. The binary operators are:

  • The disjunction-operator |. The language R | S matches any string matched by R or S and only those.
  • The conjunction-operator &. The language R & S mathces any string matched by both R and S and only those.
  • The difference-operator -. The language R - S matches any string matched by R, but not by S and only those.

By default the binary operations bind from the left. Hence a - a - a is equivalent to [ a - a ] - a i.e. matches the empty language. If the binary operators would bind from the right, then a - a - a would be equivalent to a - [ a - a ] i.e. equivalent to a.

Precedence

The operators in htwolc have different precedence. As a rule of thumb unary operators are the strongest, then concatenation and last binary operators. The constructions [ ... ] and ( ... ) override all other precedence rules.

Operators ordered by precedence from strongest to weakest:

  1. Unary operators: ^INTEGER, $, $., \, ~, *, +
  2. Concatenation
  3. Binary operators: |, & -

E.g. ~a^3 b | c d* is interpreted as

[  [ ~[ a ^ 3]  ] b ] | [ c [ d* ]  ]

List of reserved words

Alphabet  Definitions  Rules  Sets 
!         ;            ?      :        
_         |            =>     <=        
<=>       \<=          [      ]
(         )            *      +
$         $.           ~      <
>         -            "      \
=         0           ^
The words and constructs may be used in rules by quoting with \. E.g. \? means question-mark, not any character-pair defined in the alphabet and \Sets is an ordinary name Sets not a declaration, that definitions of sets will follow. In the previous example \Sets could be used as a character in the alphabet, the name of a regular expression in the definition section of the grammar or the name of a set.

Types of rules

Ordinary twol-rules

Generalized context-restrictions

Special rule-constructs

Resolution of conflicts between the rules

A test-tool for grammars

Getting the program

Installing

Unimplemented features

This list contains features which, for the time being, are lacking from hfst-twolc, but will be added, or have been implemented differently from Xerox twolc, but will be changed.

  • Sets should be implemented as collections of single characters. E.g.
    Vowel = a e i o u y å ä ö ;
    is a set of vowel-characters. Using the set Vowel one can define different regular languages like
    • Vowel the language consisting of all pairs in the alphabet, where both the input and output character is in the set Vowel.
    • :Vowel the language consisting of all pairs in the alphabet, where the ouput character is in the set Vowel.
  • Diacritics.
  • One should be able to freely insert newlines in regular expressions.
  • In rules with multiple contexts, different contexts should be separated by a semicolon ;, not by an obligatory newline.
  • Rules containing variables. E.g.
    e:&#7869; <= Nx _ Nx ; where Nx in Nasal ;
    Here Nasal is a set (in the Xerox meaning).
  • where ... ( matched | freely | mixed ) construction.
  • A default symbol for word-boundaries (not clear, whether this will be implemented or not).
  • Resolution of conflicts between the rules.

Differences from Xerox twolc

This list contains features, which are intended to differ from corresponding features in the Xerox twolc program.

  • All valid character-pairs should be declared in the Alphabet. Other character-pairs may be used in the rules, but this will raise a warning. The construction ? (and corresponding constructions) in regular expressions only matches character-pairs, which have been declared in the Alphabet.

References


<--  
-->
-- MiikkaSilfverberg - 13 May 2008
 
META TOPICMOVED by="KristerLinden" date="1212070743" from="KitWiki.HFSTTwolC" to="KitWiki.HfstTwolC"

Revision 22008-05-29 - KristerLinden

Line: 1 to 1
Changed:
<
<
META TOPICPARENT name="HFSTHome"
>
>
META TOPICPARENT name="HfstHome"
 

HFST: Two-Level Rule Compiler

see also:

Line: 8 to 8
 

-- KristerLinden - 27 May 2008

Added:
>
>
META TOPICMOVED by="KristerLinden" date="1212070743" from="KitWiki.HFSTTwolC" to="KitWiki.HfstTwolC"

Revision 12008-05-27 - KristerLinden

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="HFSTHome"

HFST: Two-Level Rule Compiler

see also:

Warning: Can't find topic KitWiki.OMorFiHtwolc

-- KristerLinden - 27 May 2008

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback