Difference: OMorFiHfstString2fst (2 vs. 3)

Revision 32008-05-19 - MiikkaSilfverberg

Line: 1 to 1
 
META TOPICPARENT name="OMorFiHome"

OMorFi: hfst-string2fst

Added:
>
>
 
Changed:
<
<

Purpose

Compile a string symbols separated by white-space into a transducer.

Usage

>
>

Usage

 
Changed:
<
<
hfst-string2fst [ -weight=INTEGER ] -input="..." -out=FILE_NAME
>
>
hfst-string2fst [ -help ] [ -weight=INTEGER ] -alphabet=FILE_NAME -input="..." -out=FILE_NAME
 
Changed:
<
<
E.g.
hfst-string2fst -weight=1 -input="k a ~N:m p:m a n" -out=transducer
is a typical way of using the transducer. It will produce a file transducer, which contains an SFST or Open-Fst style transducer having the state-transition diagram:
>
>

Purpose

Compile a string of symbols into a transducer.

Parameters

name intention
-help Display a help-message and exit.
-weight=INTEGER The weight INTEGER will be added to the last transition of resulting transducer.
-alphabet=FILE_NAME The file FILE_NAME contains the alphabet of the resulting transducer.
-input="..." The input-string.
-out=FILE FILE is name of the file, where the resulting transducer is written.

The weight-parameter is useless, if the program is built using the library hsfst ( see Installing the program below).

The alphabet-file

The alphabet-file consists of two blocks of lines. The first block contains lines consisting of one symbol and the second block contains lines consisting of two symbols separated by a :.

A valid symbol is:

  • A utf-8 character, except the characters :, <, > and any white-space-character.
  • A multi-character-symbol consisting of utf-8 characters (except newlines) enclosed in angle-brackets <...>. The characters < and > have to be escaped by a \ inside multi-character-symbols. Hence e.g. < \<-- > is a valid symbol, but < <-- > isn't.

The special symbols 0 and <> signify the empty character. If you'd like to use 0 in another meaning, enclose it in angle brackets: <0>.

The : or any white-space-character except the newline, can be used as a part of a symbol by enclosing it in angle-brackets. E.g. < > and < a symbol> are valid symbols.

Any symbol following a \ inside angle brackets is considered to have its literal meaning. If you'd like to use the \ as a part of a symbol you can escape it: \\.

Empty lines in the alphabet-file are discarded.

RULE OF THUMB: Any special character, used as a literal, has to be inside angle brackets. Any character inside angle brackets may be escaped by a \. No character outside angle brackets can be escaped.

The following would make a valid alphabet-file:

 
Changed:
<
<
0 k 1 1 a 2 2 ~N:m 3 3 p:m 4 4 a 5 5 n 6 final 6
>
>
k m n p k:k n:n p:p : :m
 
Changed:
<
<
If the program has been compiled using the HFST-library hofst, an additional file symbol_table will be written and the final state of the transducer will have weight 1.
>
>
The symbols in the alphabet will be read in order, so the symbol k is coded as the first character of the alphabet, m as the second and so on, in the example above. The symbols in the above alphabet are k, m, n, p, and ArchN. There are no other symbols in the alphabet.
 
Changed:
<
<
If the library hsfst is used to compile the program, no weight will be added to the transducer (since SFST doesn't support weights). No file symbol_table will be written either.
>
>
The pairs in the alphabet above are k:k, n:n, p:p, : and ArchN:m. There are no other pairs in the alphabet. E.g. m:m is not a pair in the alphabet, since it hasn't been declared.
 
Changed:
<
<

Notes

>
>
Note, that the angle brackets serve only to distinguish a multi-character-symbol from its surrounding (and to declare symbol-names, that include special characters). E.g. the symbol declared <ArchN> in the above alphabet corresponds to the symbol ArchN in the alphabet of the resulting transducer. If you'd like the alphabet of the transducer to include the symbol <ArchN> , you have to declare the symbol <\<ArchN\>> in the alphabet-file.
 
Changed:
<
<
The input has to be a string of ASCII or UTF-8 characters. Any sequence of characters not containing spaces, tabs, newlines or colons is considered to make a symbol in the alphabet of the transducer, which is being constructed (e.g. p). A colon separates the upper and lower characters of a pair.
>
>

The input-string

 
Changed:
<
<
A string of characters not including spaces, newlines, tabs or colons separated from its context by spaces, newlines or tabs (e.g. ~N) denotes a pair with equal deep-character and surface-character (here ~N:~N).
>
>
The input consists of symbols (defined above) and pairs of two symbols separated by a :. White-space, not enclosed in angle brackets is discarded. A lonely symbol (one not followed by a :) is regarded as a pair, where the input- and output- characters are identical.
 
Changed:
<
<
The input-string should be delimited by " characters, but shouldn't contain such (not even escaped ones). So the input-strings "a " e " b" and "a \" e \" b" are both illegal, but "a e b" is fine.
>
>
Any character-pair used in the input for the program, should be declared in the alphabet. If this isn't the case, a warning will be issued, but the transducer is still created. Note, that this may cause problems, when you're trying to use the transducer!
 
Changed:
<
<

Getting the program

>
>
One valid input-string for the alphabet, given above, is:
k<ArchN>:mpn

Output

An OpenFst or SFST transducer will be stored in the output-file given to the program by the parameter -out. Which kind of transducer is created, depens on the library, that is used to build the program (see Installing the program, below).

If an OpenFst transducer is created, a file called symbol_table will also be written. The file contains the alphabet of the transducer. If the your working-directory already contains a file called symbol_table, the file will be over-written! This means, that the behaviour of OpenFst transducers in the same directory will change, unless they've got the same alphabet as the new transducer you've created.

 
Changed:
<
<
The program is in the CVS-repository on corpus in the directory
>
>

Example

Let sigma be a file containing the alphabet discussed above. Now the command-line

hfst-string2fst -weight=1 -alphabet=sigma -input="k<ArchN>:mpn" -out=transducer
will create a transducer with the transition-network
 
Changed:
<
<
/c/appl/ling/koskenni/cvsrepo/hfst-tools/
>
>
0 1 k k 1 2 2 3 ArchN m 3 4 p p 4 5 5 6 n n 1 6
 
Added:
>
>
Here every row corresponds to a transition. There are five columns:

  • The first column contains the state before reading a pair.
  • The second column contains the state after reading the pair.
  • The third column contains the input-character.
  • The fourth column contains the output-character.
  • The fifth column contains the weight of the transitions.
 
Changed:
<
<

Building the program

>
>
Rows containing a single number signify, that the state corresponding to that number is a final state. If a line contains two numbers, the first is the number of a final state and the second is the final weight of that state.
 
Changed:
<
<
The program is distributed with a Makefile, which you might have to change a bit.
>
>
E.g. the fifth row in the network above codes a transition from the fifth state to the sixth with the pair n:n and weight 1.
 
Changed:
<
<
The line
>
>
The resulting transducer is stored in the file transducer. If the file exists already, it will be over-written.

Getting the program

The program is in the cvs-repository on corpus in the directory hfst-tools.

Installing the program

The program is distributed with a Makefile. Basically you just need to run make, but you might have to make small adjustments to the file.

You need to edit the line:

 
HFSTPATH=../hfst
Changed:
<
<
should be changed depending on, where you've got HFST installed.

If you want to build using the library hsfst, instead of the library hofst you should comment the line

>
>
to correspond to the path where you've installed HFST. If you'd like to build using the library hofst you don't need to change anything else. If you'd like to build using the library hsfst, uncomment the lines:
 
Changed:
<
<
INCLUDES=-I$(HFSTPATH)
>
>
#LIBS=-static -L$(HFSTPATH) -lhsfst
 
Changed:
<
<
uncomment the line
>
>
and
 
Changed:
<
<
#INCLUDES=-I$(HFSTPATH)/ -I$(SFST_INCLUDE_PATH)/
>
>
#INCLUDES=-I$(HFSTPATH) -I$(HFSTPATH)/sfst
 
Changed:
<
<
comment the line
>
>
and comment the lines:
 
Changed:
<
<
LIBS=-static -L$(HFSTPATH) -l$(OPEN_FST_LIB) -lpthread -lm -ldl
>
>
LIBS=-static -L$(HFSTPATH) -lhofst -lpthread -lm -ldl
 
Changed:
<
<
and uncomment the line
>
>
and
 
Changed:
<
<
#LIBS=-static -L$(HFSTPATH) -l$(SFST_LIB)
>
>
INCLUDES=-I$(HFSTPATH)
 
Deleted:
<
<

 
<--  
-->
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback