OMorFi: hfst-string2fst

Usage

hfst-string2fst [ -help ] [ -weight=INTEGER ] -alphabet=FILE_NAME -input="..." -out=FILE_NAME

Purpose

Compile a string of symbols into a transducer.

Parameters

name intention
-help Display a help-message and exit.
-weight=INTEGER The weight INTEGER will be added to the last transition of resulting transducer.
-alphabet=FILE_NAME The file FILE_NAME contains the alphabet of the resulting transducer.
-input="..." The input-string.
-out=FILE FILE is name of the file, where the resulting transducer is written.

The weight-parameter is useless, if the program is built using the library hsfst ( see Installing the program below).

The alphabet-file

The alphabet-file consists of two blocks of lines. The first block contains lines consisting of one symbol and the second block contains lines consisting of two symbols separated by a :.

A valid symbol is:

  • A utf-8 character, except the characters :, <, >, \ and any white-space-character.
  • A multi-character-symbol consisting of utf-8 characters (except newlines) enclosed in angle-brackets <...>. The characters < and > have to be escaped by a \ inside multi-character-symbols. Hence e.g. < \<-- > is a valid symbol, but < <-- > isn't.

The special symbols 0 and <> signify the empty character. If you'd like to use 0 in another meaning, enclose it in angle brackets: <0>.

The : or any white-space-character except the newline, can be used as a part of a symbol by enclosing it in angle-brackets. E.g. < > and < a symbol> are valid symbols.

Any symbol following a \ inside angle brackets is considered to have its literal meaning. If you'd like to use the \ as a part of a symbol you can escape it: \\. The escape-character can only occur inside a multi-character-symbol, so the way to declare the symbol \ in the alphabet is <\\>.

Empty lines in the alphabet-file are discarded.

RULE OF THUMB: Any special character, used as a literal, has to be inside angle brackets. Any character inside angle brackets may be escaped by a \. No character outside angle brackets can be escaped.

The following would make a valid alphabet-file:

k
m
n
p

<ArchN>
k:k
n:n
p:p
:
<ArchN>:m

The symbols in the alphabet will be read in order, so the symbol k is coded as the first character of the alphabet, m as the second and so on, in the example above. The symbols in the above alphabet are k, m, n, p, and ArchN. There are no other symbols in the alphabet.

The pairs in the alphabet above are k:k, n:n, p:p, : and ArchN:m. There are no other pairs in the alphabet. E.g. m:m is not a pair in the alphabet, since it hasn't been declared.

Note, that the angle brackets serve only to distinguish a multi-character-symbol from its surrounding (and to declare symbol-names, that include special characters). E.g. the symbol declared <ArchN> in the above alphabet corresponds to the symbol ArchN in the alphabet of the resulting transducer. If you'd like the alphabet of the transducer to include the symbol <ArchN> , you have to declare the symbol <\<ArchN\>> in the alphabet-file.

The input-string

The input consists of symbols (defined above) and pairs of two symbols separated by a :. White-space, not enclosed in angle brackets is discarded. A lonely symbol (one not followed by a :) is regarded as a pair, where the input- and output- characters are identical.

Any character-pair used in the input for the program, should be declared in the alphabet. If this isn't the case, a warning will be issued, but the transducer is still created. Note, that this may cause problems, when you're trying to use the transducer!

One valid input-string for the alphabet, given above, is:

k<ArchN>:mpn

Output

An OpenFst or SFST transducer will be stored in the output-file given to the program by the parameter -out. Which kind of transducer is created, depens on the library, that is used to build the program (see Installing the program, below).

If an OpenFst transducer is created, a file called symbol_table will also be written. The file contains the alphabet of the transducer. If the your working-directory already contains a file called symbol_table, the file will be overwritten! This means, that the behaviour of OpenFst transducers in the same directory will change, unless they've got the same alphabet as the new transducer you've created.

Example

Let sigma be a file containing the alphabet discussed above. Now the command-line

hfst-string2fst -weight=1 -alphabet=sigma -input="k<ArchN>:mpn" -out=transducer
will create a transducer with the transition-network
0  1  k     k
1  2       
2  3  ArchN m
3  4  p     p
4  5       
5  6  n     n  1
6
Here every row corresponds to a transition. There are five columns:

  • The first column contains the state before reading a pair.
  • The second column contains the state after reading the pair.
  • The third column contains the input-character.
  • The fourth column contains the output-character.
  • The fifth column contains the weight of the transitions.

Rows containing a single number signify, that the state corresponding to that number is a final state. If a line contains two numbers, the first is the number of a final state and the second is the final weight of that state.

E.g. the fifth row in the network above codes a transition from the fifth state to the sixth with the pair n:n and weight 1.

The resulting transducer is stored in the file transducer. If the file exists already, it will be overwritten.

Getting the program

The program is in the cvs-repository on corpus in the directory hfst-tools.

Installing the program

The program is distributed with a Makefile. Basically you just need to run make, but you might have to make small adjustments to the file. It's best to have the latest version of HFST istalled.

You need to edit the line:

HFSTPATH=../hfst
to correspond to the path where you've installed HFST. If you'd like to build using the library hofst you don't need to change anything else. If you'd like to build using the library hsfst, uncomment the lines:
#LIBS=-static -L$(HFSTPATH) -lhsfst
and
#INCLUDES=-I$(HFSTPATH) -I$(HFSTPATH)/sfst
and comment the lines:
LIBS=-static -L$(HFSTPATH) -lhofst -lpthread -lm -ldl
and
INCLUDES=-I$(HFSTPATH)
-- MiikkaSilfverberg - 14 May 2008
Topic revision: r4 - 2008-05-20 - MiikkaSilfverberg
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback