HFST: Finnish OT Prosody

NOTE: The character ´ shows as ´ inside the verbatim sections, probably due to a bug in KitWiki formalism.

We examplify the use of HFST command line tools with an example taken from Beesley & Karttunen that maps Finnish words into a prosodic representation. that splits the words into syllables, adds primary and secondary stress marks, and organizes the syllables into feet.

For more information on the representation, see the original solution. $FORMAT is the implementation type of the transducer. The solution given on this page can also be executed with a single script.

You may find it interesting to compare this OT implementation of with a non-OT account for the same data. See the Finnish Non-OT Prosody solution.

Data:

echo "{kalastelet} | {kalasteleminen} | {ilmoittautuminen} |
                 {jrjestelmttmyydestns} | {kalastelemme} |
                 {ilmoittautumisesta} | {jrjestelmllisyydellni} |
                 {jrjestelmllistmtnt} | {voimisteluttelemasta} |
                 {opiskelija} | {opettamassa} | {kalastelet} |
                 {strukturalismi} | {onnittelemanikin} | {mki} |
                 {perij} | {repem} | {ergonomia} | {puhelimellani} |
                 {matematiikka} | {puhelimistani} | {rakastajattariansa} |
                 {kuningas} | {kainostelijat} | {ravintolat} |
                 {merkonomin}" | hfst-regexp2fst -f $FORMAT > FinnWords

Basic definitions:

echo '[u | y | i]' | hfst-regexp2fst -f $FORMAT > HighV                          # High vowel
echo '[e | o | ]' | hfst-regexp2fst -f $FORMAT > MidV                          # Mid vowel
echo '[a | ]' | hfst-regexp2fst -f $FORMAT > LowV                             # Low vowel
echo '[HighV | MidV | LowV]' | hfst-regexp2fst -f $FORMAT > USV                  # Unstressed Vowel

echo '[b | c | d | f | g | h | j | k | l | m |
          n | p | q | r | s | t | v | w | x | z]' | hfst-regexp2fst -f $FORMAT > C  # Consonant

echo '[ |  |  |  |  |  | ´ | ´]' | hfst-regexp2fst -f $FORMAT > MSV
echo '[ |  |  |  |  | y` | ` | `]' | hfst-regexp2fst -f $FORMAT > SSV 
echo '[@"MSV" | @"SSV"]' | hfst-regexp2fst -f $FORMAT > SV                              # Stressed vowel
echo '[@"USV" | @"SV"] ' | hfst-regexp2fst -f $FORMAT > V                               # Vowel

echo '[@"V" | @"C"]' | hfst-regexp2fst -f $FORMAT > P                                   # Phone
echo '[[\@"P"+] | .#.]' | hfst-regexp2fst -f $FORMAT > B                             # Boundary

echo '.#. | "."' | hfst-regexp2fst -f $FORMAT > E                                 # Edge
echo '[~$"." "." ~$"."]' | hfst-regexp2fst -f $FORMAT > SB                        # At most one syllable boundary

echo '[@"C"* @"V"]' | hfst-regexp2fst -f $FORMAT > Light                                # Light syllable
echo '[Light @"P"+]' | hfst-regexp2fst -f $FORMAT > Heavy                            # Heavy syllable

echo '[@"Heavy" | @"Light"]' | hfst-regexp2fst -f $FORMAT > S                           # Syllable
echo '[@"S" & $@"SV"]' | hfst-regexp2fst -f $FORMAT > SS                                # Stressed syllable
echo '[@"S" & ~$@"SV"]' | hfst-regexp2fst -f $FORMAT > US                               # Unstressed syllable
echo '[@"S" & $@"MSV"] ' | hfst-regexp2fst -f $FORMAT > MSS                             # Syllable with main stress
echo '[@"S" "." @"S"]' | hfst-regexp2fst -f $FORMAT > BF                                # Binary foot

Gen:

# A diphthong is a combination of two unlike vowels that together form
# the nucleus of a syllable. In general, Finnish diphthongs end in a high vowel.
# However, there are three exceptional high-mid diphthongs: ie, uo, and y
# that historically come from long ee, oo, and , respectively.
# All other adjacent vowels must be separated by a syllable boundary.

echo '[ [. .] -> "." || [@"HighV" | @"MidV"] _ @"LowV",
                                           i _ [@"MidV" - e],
                                           u _ [@"MidV" - o],
                                           y _ [@"MidV" - ] ]' | hfst-regexp2fst -f $FORMAT > MarkNonDiphtongs

# The general syllabification rule has exceptions. In particular, loan
# words such as ate.isti 'atheist' must be partially syllabified in the
# lexicon.

echo '@"C"* @"V"+ @"C"* @-> ... "." || _ @"C" @"V"' | hfst-regexp2fst -f $FORMAT > Syllabify

# Optionally adds primary or secondary stress to the first vowel
# of each syllable.

echo 'a (->) |, e (->) |, i (->) |, o (->) |,
              u (->) |, y (->) |y`,  (->) ´|`,  (->) ´|`
              || @"E" @"C"* _' | hfst-regexp2fst -f $FORMAT > Stress

# Scan the word, optionally dividing it to any combination of
# unary, binary, and ternary feet. Each foot must contain at least
# one stressed syllable.

echo '[[@"S" ("." @"S" ("." @"S")) & $@"SS"] (->) "(" ... ")" || @"E" _ @"E"]' | hfst-regexp2fst -f $FORMAT > Scan

# In keeping with the idea of "richness of the base", the Gen
# function produces a great number of output candidates for
# even short words. Long words have millions of possible outputs.

echo '[@"MarkNonDiphthongs" .o. @"Syllabify" .o. @"Stress" .o. @"Scan"]' | hfst-regexp2fst -f $FORMAT > Gen

OT constraints:

# We use asterisks to mark constraint violations. Ordinary constraints
# such as Lapse assign single asterisks as the violation marks and the
# candidate with the fewest number is selected. Gradient constraints
# such as AllFeetFirst mark violations with sequences of asterisks.
# The number increases with distance from the word edge.

# Every instance of * in an output candidate is a violation.

echo '${*}' | hfst-regexp2fst -f $FORMAT > Viol

# We prune candidates with "lenient composition" that eliminates
# candidates that violate the constraint provided that at least
# one output candidate survives.

echo '~@"Viol"' | hfst-regexp2fst -f $FORMAT > Viol0         # No violations
echo '~[@"Viol"^2]' | hfst-regexp2fst -f $FORMAT > Viol1     # At most one violation
echo '~[@"Viol"^3]' | hfst-regexp2fst -f $FORMAT > Viol2     # At most two violations
echo '~[@"Viol"^4]' | hfst-regexp2fst -f $FORMAT > Viol3     # etc.
echo '~[@"Viol"^5]' | hfst-regexp2fst -f $FORMAT > Viol4 
echo '~[@"Viol"^6]' | hfst-regexp2fst -f $FORMAT > Viol5 
echo '~[@"Viol"^7]' | hfst-regexp2fst -f $FORMAT > Viol6 
echo '~[@"Viol"^8]' | hfst-regexp2fst -f $FORMAT > Viol7 
echo '~[@"Viol"^9]' | hfst-regexp2fst -f $FORMAT > Viol8 
echo '~[@"Viol"^10]' | hfst-regexp2fst -f $FORMAT > Viol9 
echo '~[@"Viol"^11]' | hfst-regexp2fst -f $FORMAT > Viol10 
echo '~[@"Viol"^12]' | hfst-regexp2fst -f $FORMAT > Viol11 
echo '~[@"Viol"^13]' | hfst-regexp2fst -f $FORMAT > Viol12 
echo '~[@"Viol"^14]' | hfst-regexp2fst -f $FORMAT > Viol13 
echo '~[@"Viol"^15]' | hfst-regexp2fst -f $FORMAT > Viol14 
echo '~[@"Viol"^16]' | hfst-regexp2fst -f $FORMAT > Viol15 

# This eliminates the violation marks after the candidate set has
# been pruned by a constraint.

echo '{*} -> 0' | hfst-regexp2fst -f $FORMAT > Pardon

Constraints:

# In this section we define nine constraints for Finnish prosody,
# listed in the order of their ranking: MainStress, Clash, AlignLeft,
# FootBin, Lapse, NonFinal, StressToWeight, Parse, and AllFeetFirst.
# For the one inviolable constraint, we assign no violation marks.
# Clash, Align-Left and Foot-Bin are always satisfiable in Finnish
# but we assign violation marks as not to depend on that knowledge.

# Main Stress: The primary stress in Finnish is on the first
#              syllable. This is an inviolable constraint.

echo '[@"B" @"MSS" ~$@"MSS"]' | hfst-regexp2fst -f $FORMAT > MainStress 


# Clash: No stress on adjacent syllables.

echo '@"SS" -> ... {*} || @"SS" @"B" _ ' | hfst-regexp2fst -f $FORMAT > Clash 


# Align-Left: The stressed syllable is initial in the foot.

echo '@"SV" -> ... {*} || .#. ~[?* "(" @"C"*] _ ' | hfst-regexp2fst -f $FORMAT > AlignLeft 


# Foot-Bin: Feet are minimally bimoraic and maximally bisyllabic.

echo '["(" @"Light" ")" | "(" @"S" ["." @"S"]^>1] -> ... {*} ' | hfst-regexp2fst -f $FORMAT > FootBin 


# Lapse: Every unstressed syllable must be adjacent to a stressed
# syllable.

echo '@"US" -> ... {*} || [@"B" @"US" @"B"] _ [@"B" @"US" @"B"]' | hfst-regexp2fst -f $FORMAT > Lapse 


# Non-Final: The final syllable is not stressed.

echo '@"SS" -> ... {*} || _ ~@"$S" .#.' | hfst-regexp2fst -f $FORMAT > NonFinal 


# Stress-To-Weight: Stressed syllables are heavy.

echo '[@"SS" & @"Light"] -> ... {*} || _ ")"| @"E"' | hfst-regexp2fst -f $FORMAT > StressToWeight 


# License-σ: Syllables are parsed into feet.

    echo '@"S" -> ... {*} || @"E" _ @"E"' | hfst-regexp2fst -f $FORMAT > Parse 


# All-Ft-Left: Every foot starts at the beginning of a
#              prosodic word.

echo '[ "(" -> ...   {*} || .#. @"SB" _
                                      .o.
                  "(" -> ... {*}^2 || .#. @"SB"^2 _
                                      .o.
                  "(" -> ... {*}^3 || .#. @"SB"^3 _
                                      .o.
                  "(" -> ... {*}^4 || .#. @"SB"^4 _
                                      .o.
                  "(" -> ... {*}^5 || .#. @"SB"^5 _
                                      .o.
                  "(" -> ... {*}^6 || .#. @"SB"^6 _
                                      .o.
                  "(" -> ... {*}^7 || .#. @"SB"^7 _
                                      .o.
                 "(" -> ... {*}^8 || .#. @"SB"^8 _ ]' | hfst-regexp2fst -f $FORMAT > AllFeetFirst

Evaluation:

# Computing the prosody for FinnWords

# Some constraints can always be satisfied; some constraints are
# violated many times. The limits have been chosen to produce
# a unique winner in all the 25 test cases in FinnWords.

echo '[FinnWords .o. Gen
       .o. MainStress
       .o. Clash .O. Viol0 .o. Pardon
       .o. AlignLeft .O. Viol0
       .o. FootBin .O. Viol0 .o. Pardon
       .o. Lapse .O. Viol3 .O. Viol2 .O. Viol1 .O. Viol0 .o. Pardon
       .o. NonFinal .O. Viol0 .o. Pardon
       .o. StressToWeight .O. Viol3 .O. Viol2 .O. Viol1 .O. Viol0 .o. Pardon
       .o. Parse .O. Viol3 .O. Viol2 .O. Viol1 .O. Viol0 .o. Pardon
       .o. AllFeetFirst .O. Viol15 .O. Viol14 .O. Viol13
           Viol12 .O. Viol11 .O. Viol10 .O. Viol9 .O. Viol8 .O. Viol7 .O.
           Viol6  .O. Viol5  .O. Viol4  .O. Viol3 .O. Viol2 .O. Viol1 .O.
           Viol0 .o. Pardon
      ]' | hfst-regexp2fst | hfst-project -p output | hfst-fst2strings

This final command produces the following output. The two errors indicate that there is a problem in Kiparsky's analysis.

# (n.nit).(t.le).(m.ni).kin
# (.pis).(k.li).ja
# (.pet).ta.(ms.sa)
# (l.moit).(tu.tu).mi.(ss.ta)
# (l.moit).(tu.tu).(m.nen)
# (r.go).(n.mi).a
# (vi.mis).te.(lt.te).le.(ms.ta)
# (strk.tu).ra.(ls.mi)
# (r.pe).(̀.m)
# (r.vin).(t.lat)
# (r.kas).ta.(jt.ta).ri.(n.sa)
# (p.he).li.(ms.ta).ni
# (p.he).li.(ml.la).ni
# (p.ri).j
# (ḿ.ki)
# (mr.ko).(n.min)
# (m.te).ma.(tik.ka)
# (k.nin).gas
# (ki.nos).(t.li).jat
# (k.las).te.(lm.me) 
# (k.las).te.(l.mi).nen  <==== Error
# (k.las).(t.let)
# (j&#769;r.jes).tel.(m&#768;l.li).syy.(dl.l).ni  <===== Error
# (j&#769;r.jes).(tl.mt).t.(my&#768;y.des).(t&#768;n.s)
# (j&#769;r.jes).(tl.ml).(ls.t).m.(t&#768;n.t)


-- ErikAxelson - 2011-09-23