Local experiences and advice on SFST

For more information on SFST see the official pages by its author, Helmut Schmid at http://www.ims.uni-stuttgart.de/projekte/gramotron/SOFTWARE/SFST.html. For other developments related to SFST, see HFST at http://www.ling.helsinki.fi/kieliteknologia/tutkimus/hfst/ and OMor pages at http://www.ling.helsinki.fi/kieliteknologia/tutkimus/omor.

Here you will find some additional advice, some observations and experienced concerning SFST and particularly its use on CSC servers.

Using SFST at CSC Corpus machine

Making the SFST and its documentation accessible

There is a copy of the official release of SFST, i.e. programs fst-compiler, etc. in the directory /l/contrib/appl/ling/koskenni/bin. If you are using the Bash shell. you might add the following line to your ~./bashrc file (which is usually included automatically to the ~/.bash_profile initialization file (which might consist of a single line if [ -f ~/.bashrc ]; then  source $HOME/.bashrc; fi):

   export PATH=/l/contrib/appl/ling/koskenni/bin:$PATH

The source and documents of it in the directory /l/contrib/appl/ling/koskenni/sfst/SFST-1.2/SFST for local viewing.

The manual pages of the SFST can be made accessible for the man program by adding the following line to your ~/.bashrc file:

  export MANPATH=/l/contrib/appl/ling/koskenni/man:$MANPATH

Correspondences and differences between SFST and XFST formalisms

The fst-compiler corresponds to XFST which is documented at http://www.xrce.xerox.com/competencies/content-analysis/fssoft/docs/fst-97/xfst97.html. SFST has no direct counterparts to LEXC or TWOLC (see however the HFST effort mentioned above). SFST has a regular expression formalism of its own and the formalism differs from the Xerox formalism in several points.

  • The fst-compiler has no stack mechanism like the XFST. (The stack is not a necessary device at all.)
  • The alphabet must be explicitly declared in the fst-compiler whereas the XFST does it automatically by accumulating the alphabet while compiling the regular expressions.

Other points to be noticed:

  • fst-compiler handles neither eight bit characters (such as Latin-1 "") nor UTF-8 encoded Unicode characters. If you need characters beyond the 7 bit ASCII You might be better off by using the fst-compiler-utf8 and UTF-8 encoded characters. (Is there a way to use Latin-1 directly?)

feature SFST XFST
term used for the input string of transducers (where the first character 'a' of a pair 'a:b' belongs) deep or analysis string/form (which is thought to be below the surface) upper string (the underlying morphophonemic form is supposed to be higher up)
term used for the output string of transducers (where the second character 'b' of a pair 'a:b' belongs) surface string/form (which is thought to be above the deep form). lower string (the linguistic surface form is supposed to be at the bottom)
term used when a transducer T relates a string A to string B B is mapped to A (SFST has the orientation of analysis of surface strings into their lexical representations) T maps A to B (for linguists, the direction is always from the morphophonological form towards the surface form)

-- KimmoKoskenniemi - 09 Apr 2008

Topic revision: r3 - 2008-04-10 - KimmoKoskenniemi
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback