HFST: Demo Outline

The analyzer and generator demos on the internet are intended to demonstate our capabilities to morphologically analyze and generate various languages. Some of the languages also contain a component for guessing paradigms of unknown words in the language.

Ideally, all language related information is encoded in the transducers: OmorXXAnalyser, OmorXXGuesser and OmorXXGenerator (where XX is the language code) run with the hfst software. However, the lexicographer may have created language-specific components for some other purpose that can be reused, in which case the HAnalyse, HGuess and HGenerate methods may be implemented as language dependent shell scripts transforming the existing output to the required format. In this case, the shell scripts are better named e.g. OmorXXAnalyse, OmorXXGuess and OmorXXGenerate, but they are still expected to take the same input and produce output in the same format as the generic methods.

Below is a specification of the two demos depending on whether the language has a statistical paradigm guessing component.

No Guessing Component

Analysis

Input: kokeilla

Output:

base form paradigm tags analysis tags
koe 48-d nominal, plural, adessive
kokeilla 67 verb, active, a-infintive, singular, lative
kokeilla 67 verb, passive, indicative, present, 4th person, negative
kokki 5-a nominal, plural, adessive

Method:

HAnalyse(Word,OmorXXAnalyser)

See: HAnalyse

Generation

Input: kokeilla

Output:

base form paradigm tags model forms
kokeilla 67 kokeilla, kokeilen, kokeili, kokeilisi, kokeillee, kokeilkoon, kokeillut, kokeiltiin

Method:

for Analysis in HAnalyse(Word,OmorXXAnalyser): 
    if Analysis.baseform=Word.baseform and not [Analysis.baseform, Analysis.paradigm] in Output: 
        Output = Output + [[Analysis.baseform, Analysis.paradigm]]
        HGenerate(Analysis,OmorXXGenerator)

NOTE. A word form may have several analyses with an identical base form and paradigm combination, for which only one model word is output

See: HGenerate

With guessing component

Analysis

Input: xkokeilla

Output:

base form paradigm tags analysis tags
xkoe 48-d nominal, plural, adessive
xkokeilla 67 verb, active, a-infintive, singular, lative
xkokeilla 67 verb, passive, indicative, present, 4th person, negative
xkokki 5-a nominal, plural, adessive
...  

Method:

Analyses = HAnalyse(Word,OmorXXAnalyser)
if not Analyses: 
    HGuess(Word,OmorXXGuesser)

See: HGuess, HAnalyse

Generation

Input: xkokeilla

Output:

base form paradigm tags model forms
xkokeilla 67 xkokeilla, xkokeilen, xkokeili, xkokeilisi, xkokeillee, xkokeilkoon, xkokeillut, xkokeiltiin

Method:

Analyses = HAnalyse(Word,OmorXXAnalyser)
if not Analyses : 
    for Analysis in HGuess(Word,OmorXXGuesser): 
        if Analysis.baseform=Word.baseform and not [Analysis.baseform, Analysis.paradigm] in Output: 
            Output = Output + [[Analysis.baseform, Analysis.paradigm]]
            HGenerate(Analysis,OmorXXGenerator)
else:
    for Analysis in Analyses: 
        if Analysis.baseform=Word.baseform and not [Analysis.baseform, Analysis.paradigm] in Output: 
            Output = Output + [[Analysis.baseform, Analysis.paradigm]]
            HGenerate(Analysis,OmorXXGenerator)

NOTE. A word form may have several analyses with an identical base form and paradigm combination, for which only one model word is output

See: HAnalyse, HGuess, HGenerate

On Equivalent Base Forms

The demo only prints unique Output lines for unique base form and paradigm combinations, but it is up to the lexicographer to decided what base forms are equivalent.

A lexicon may give several analyses for a word form, e.g. "talonpojan":

talon<wb>poika<noun><10><d><sg><gen>
talonpoika<noun><10><d><sg><gen>
talo<sg><gen>poika<noun><10><d><sg><gen>
talonpoika<noun><10><d><sg><acc>

If the lexicographer wishes these to be considered equivalent, the lexicon output should be unified in HAnalyse (or in that case preferably OmorXXAnalyse), outputting e.g.

<base>talonpoika</base> <par><10><d></par> <anl><noun><sg><gen></anl>
<base>talonpoika</base> <par><10><d></par> <anl><noun><sg><acc></anl>

The lines will be interpreted by HDemo as:

Analysis.baseform='talonpoika'  Analysis.paradigm='<10><d>'  Analysis.tags='<noun><sg><gen>'
Analysis.baseform='talonpoika'  Analysis.paradigm='<10><d>'  Analysis.tags='<noun><sg><acc>'

It is possible to indicate word and morpheme boundaries in HAnalyse with a '|' without affecting the interpretation of the analysis:

<base>talon|poika</base> <par><10><d></par> <anl><noun><sg><gen></anl>

This line will also be interpreted by HDemo as:

Analysis.baseform='talonpoika'  Analysis.paradigm='<10><d>'  Analysis.tags='<noun><sg><gen>'

Web Demo API Interface

The following code is used for calling the web interface:

#! /usr/bin/env python
# -*- coding: utf8 -*-

from omorcgidemo import *

def main():
    initialize_variables(
        analyser         = './OMorFiAnalyser',
        generator        = './OMorFiGenerator',
        guesser          = './OMorFiGuesser',
        language         = 'Finnish',
        lang_code        = 'fi',
        title            = 'Omorfi - Demo of Finnish Morphology',
        lexicon_source   = 'http://kaino.kotus.fi/sanat/nykysuomi/',
        lexicon_name     = 'Nykysuomen sanalista',
        script           = 'omorfi-cgi-demo.py')
    interact()

main()


-- KristerLinden - 22 Apr 2008

Topic revision: r3 - 2008-05-29 - KristerLinden
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback