HFST: String and text format issues

The epsilon symbol

functionality default tools note
print strings the empty string fst2strings, lookup  
print or read AT&T format @0@ fst2txt, txt2fst  
convert strings into transducers @0@ strings2fst Must be included in the tokenizer if spaces are not used.
Xerox strings 0 hfst-lexc Escaped 0 is %0
Xerox regular expressions 0 hfst-lexc, hfst-twolc, hfst-xfst, hfst-regexp2fst Escaped 0 is %0

The default epsilon symbol can be changed with the option -e.

Other special symbols

symbol escaped form tools note
space (U+0020, SP) @_SPACE_@ fst2strings, lookup, strings2fst, fst2txt, txt2fst  
colon (U+003A) @_COLON_@ fst2strings, lookup, strings2fst  
horizontal tab (U+0009, HT) @_TAB_@ fst2strings, lookup, strings2fst, fst2txt, txt2fst  

The table below shows how these special symbols are handled in the tools.

tool how to handle the symbols
fst2strings Replace strings when printing.
lookup Replace strings when printing. Include in the tokenizer if spaces are not used. Replace after tokenization.
strings2fst Include in the tokenizer if spaces are not used. Replace in after tokenization.
fst2txt Implemented in the HFST library.
txt2fst Implemented in the HFST library.

String formats

The options in table concern tools hfst-fst2strings and hfst-strings2fst

format option example  
string pair default cat:dog
string pair with spaces -S c a t : d o g
pair-string -P c:da:ot:g
pair-string with spaces -P -S c:d a:o t:g
Xerox print-pairs  -X print-pairs =
Xerox spaced text   c a t
d o g

Text formats

format option syntax
AT&T default tab separated list of transitions (5 fields) and final states (2 fields).


-- ErikAxelson - 2011-03-09
Topic revision: r2 - 2011-03-10 - TommiPirinen
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback