HFST: Symbol Layer API

This document deals with the datatypes related to symbol manipulation and support of symbols at the API level.

Datatypes for Symbols and Symbol Alphabets

Types

NOTE. In the tables below, the hfst-lexc and hfst-twolc columns refer to what is or will be needed in hfst-lexc and hfst-twolc, repectively. Y stands for yes, N stands for no.

Name of the type Comment hfst-lexc hfst-twolc
Symbol A symbol handle. Y Y
SymbolSet A set of symbols aka an alphabet of symbols. Y Y
SymbolIterator Iterator over the symbols in a SymbolSet.    
SymbolPair A pair of symbols representing a transition in a transducer. Y Y
SymbolPairSet A set of symbol pairs aka an alphabet of symbol pairs. Y Y
SymbolPairIterator Iterator over the set of symbol pairs in a SymbolPairSet.    
KeyTable A table for storing symbols associated with keys. Y Y

Functions

Defining and using symbols:

Type Function Comment hfst-lexc hfst-twolc
Symbol define_symbol (char *s) Define a symbol in the symbol table with the symbol name s. Y Y
bool is_symbol (const? char *s) Whether the string s indicates a symbol. Y Y
Symbol get_symbol (const? char *s) Find the symbol for the symbol name s. (s must refer to a symbol name. Use is_symbol to check this if you are not sure.) Y Y
char * get_symbol_name (Symbol s) Find the symbol name for the symbol s. Y Y
bool is_equal (Symbol s1, Symbol s2) Whether the symbol s1 is equal to symbol s2.
Defining and using alphabets of symbols:

Type Function Comment hfst-lexc hfst-twolc
SymbolSet* create_empty_symbol_set () Define an empty set of symbols.   Y
SymbolSet* insert_symbol (Symbol s, SymbolSet *Si) Insert s into the set of symbols Si and return the updated set.   Y
bool has_symbol (Symbol s, SymbolSet *Si) Whether symbol s is a member of the set of symbols Si.   Y

Iterators over symbols:

Type Function Comment hfst-lexc hfst-twolc
SymbolIterator begin_sigma_symbol (SymbolSet *Si) Beginning of the symbol iterator si.    
SymbolIterator end_sigma_symbol (SymbolSet *Si) End of the iterator for the symbol set si.    
size_t size_sigma_symbol (SymbolSet *Si) Size of the iterator for the symbol set si.    
Symbol get_sigma_symbol (SymbolIterator si) Get the symbol represented by the symbol iterator si.    

Defining and using symbol pairs:

Type Function Comment hfst-lexc hfst-twolc
SymbolPair* define_symbolpair (Symbol s1, Symbol s2) Define an symbol pair with input symbol s1 and output symbol s2.    
Symbol get_input_symbol (SymbolPair *s) Get the input symbol of SymbolPair s.    
Symbol get_output_symbol (SymbolPair *s) Get the output symbol of SymbolPair s.    

Defining and using alphabets of symbol pairs:

Type Function Comment hfst-lexc hfst-twolc
SymbolPairSet* create_empty_symbolpair_set () Define an empty set of pairs.   Y
SymbolPairSet* insert_symbolpair (SymbolPair *p, SymbolPairSet *Pi) Insert p into the set of symbol pairs Pi and return the updated set.   Y
bool has_symbolpair (SymbolPair *p, SymbolPairSet *Pi) Whether symbol pair p is a member of the set of symbol pairs Pi.   Y

Iterators over symbol pairs.

Type Function Comment hfst-lexc hfst-twolc
SymbolPairIterator begin_pi_symbol (SymbolPairSet *Pi) Beginning of the symbol pair iterator pi.    
SymbolPairIterator end_pi_symbol (SymbolPairSet *Pi) End of the symbol pair iterator pi.    
size_t size_pi_symbol (SymbolPairSet *Pi) Size of the symbol pair iterator pi.    
SymbolPair* get_pi_symbolpair (SymbolPairIterator pi) Get the symbol pair represented by the symbol pair iterator pi.    

Defining the connection between symbols and keys of transducers:

NOTE. The relation 1:N between keys and symbols is useful for dealing with equvivalence classes of symbols.

Type Function Comment hfst-lexc hfst-twolc
KeyTable* create_key_table () Create a key table. Y Y
bool is_key (Key i, KeyTable *T) Whether i indicates an existing key in key table T.    
bool is_symbol (Symbol s, KeyTable *T) Whether s indicates an existing symbol in key table T.    
Key associate_new_key (KeyTable *T, Symbol s) Associates symbol s to first unused Key and returns reference to that Key. Y  
void associate_key (Key i, KeyTable *T, Symbol s) Associate the key i in the key table T with the symbol s. Y Y
Key get_key (Symbol s, KeyTable *T) Find the key for the symbol s in key table T. Y Y
Symbol get_key_symbol (Key i, KeyTable *T) Find a symbol for the key i in key table T. (The default symbol is a symbol created from the key if no other symbol is associated with the key (???), and an implementation dependent choice if there are several symbols associated with the key.) Y Y

Type Function Comment hfst-lexc hfst-twolc
KeyTable* read_key_table (const? char *filename) Read a key table from the file filename. Y Y
void write_key_table (KeyTable *T, const? char *filename) Write the key table T in file filename. Y Y

Reading and Writing Symbol Strings

Read transducers

  • in text format from pair strings and input streams and
  • in binary format from files and input streams so that the keys used in the transducer are harmonized according to a key table.

Type Function Comment hfst-lexc hfst-twolc
TransducerHandle longest_match_tokenizer (TransducerHandle t, Key marker) Create a longest match tokenizer based on paths in transducer t.
TransducerHandle tokenize_pair_string (TransducerHandle t, char * input, Key m, KeyTable *T) Tokenize the input pair string into a synchronic transducer using the longest matching input symbol print names provided in tokenization transducer t using marker keyed m. Symbols of the string input must consist UTF-8 symbols found in T. Y N
TransducerHandle tokenize_string_pair (TransducerHandle t, char * i1, char * i2, KeyTable *T) Tokenize the input strings i1 and i2 into a synchronic transducer using the longest matching input symbol print names provided in tokenization transducer t using marker m. The symbol sequences of the two strings are matched and, if the symbol sequences are of unequal length, the final symbols of the longer sequence are matched against the epsilon symbol. Symbols of the string input must consist UTF-8 symbols found in T. Y N
TransducerHandle read_transducer_text (FILE *file, KeyTable *T, bool sfst=false) Make a transducer as defined in text form in file. The parameter sfst defines whether SFST text format is used, otherwise OpenFST (AT&T) format is used. N N

Read transducers in binary format from files and input streams.

Type Function Comment hfst-lexc hfst-twolc
int read_format (istream &is=cin) Read the format of the next transducer in the input stream is. Y N
TransducerHandle read_transducer (const? char *filename, KeyTable *symbols) Read a binary transducer from file filename. The transducer is assumed to have an alphabet. N N
TransducerHandle read_transducer (KeyTable *symbols, istream &is=cin) Read transducer in binary form from input stream is. The transducer is assumed to have an alphabet. The transducer is harmonized according to the key table symbols. Y N

Writing transducers

Write transducers

  • in text format into symbol pair strings and output streams and
  • in binary format to output streams and files so that the print names associated to keys are stored with the transducer.

Type Function Comment hfst-lexc hfst-twolc
char* path_to_string (TransducerHandle t, KeyTable *T) Get a pairstring presentation of one-path transducer t.    

Type Function Comment hfst-lexc hfst-twolc
void print_transducer (TransducerHandle t, bool print_weights=false, ostream &ostr=cout, bool old=false) Print transducer t in text format. The parameter print_weights indicates whether weights are included, The output stream ostr indicates where printing is directed. Parameter old indicates whether transducer t should be printed in SFST text format. N Y?

Type Function Comment hfst-lexc hfst-twolc
void write_transducer (TransducerHandle t, KeyTable *symbols, ostream &os=cout) Write t in binary form to output stream os. symbols indicates the name of the key table that is stored with the transducer. Y Y
void write_transducer (TransducerHandle t, const? char *filename, KeyTable *symbols) Write transducer t to file filename. N Y

Auxiliary Datatypes for Symbols and Alphabets

A range is a list of symbols used for making the construction of sets of symbol pairs more convenient. A list of range lists makes up the data type Ranges.

NB. The range and ranges datatypes are not considered essential for manipulating transducers and is currently not officially supported in the HFST API.

Note. hfst-twolc uses functions, which convert ranges into sets, since symbol sets are used in two different meanings in two-level rules. They can be used as a part of regular expressions (a set-like use) and as realisations of variables (a range-like use).

Note. hfst-lexc uses ranges to build some transducers, e.g. characterSetToDisjunction. This is implemented directly in the HFST API as make_pi_transducer(SymbolPairSet)

Types

Name of the type Comment hfst-lexc hfst-twolc
Range A list of 'Symbol's N Y
Ranges A list of 'Range's N N

Functions

Functions for manipulating a range of symbols:

Type Function Comment hfst-lexc hfst-twolc
Range * create_empty_range () Define an empty range. N Y
Range * insert_value (Symbol c, Range *r) Insert Symbol c into Range r. N Y
Range * append_range (Range *r1, Range *r2) Append Symbols from Range r2 to Range r1. N N
Range * complement_range (Range *r, SymbolPairSet Pi) Complement Range r with regard to a set of symbol pairs Pi. N N
SymbolPairSet define_pair_range (Range *r1, Range *r2, bool final) Make a set of symbol pairs from input Range r1 to output Range r2 matching each symbol in the input and output range. If final is true, the final symbol in the shorter range is matched against the remaining symbols in the longer range, otherwise the epsilon symbol is matched. N Y
Range * insert_values (unsigned int c1, unsigned int c2, Range *r) THIS BELONGS TO THE SYMBOL LAYER: Insert Symbols from c1 to c2, inclusive, into Range r. N N

Functions for manipulating ranges of symbols:

Type Function Comment hfst-lexc hfst-twolc
Ranges * create_empty_ranges () Define empty ranges. N N
Ranges * insert_range (Range *r, Ranges *rs) Insert Range r at the end of Ranges rs. N N
SymbolPairSet define_pair_ranges (Ranges *rs1, Ranges *rs2) Calls make_pair_range(r1, r2, false) for each matching range in input Ranges rs1 and output Ranges rs2. N N

To assist in constructing and manipulating symbols in a finite-state transducer calculus, some auxiliary operations on variables and agreement variables are defined:

Type Function Comment hfst-lexc hfst-twolc
bool define_symbol_variable (char *name, Range *r) Define a variable name with values in the Range r. N Y?
Range * copy_symbol_variable (char *name) Copy the value of the Range variable name. N Y?
bool define_symbol_agreement_variable (char *name, Range *r) Define an agreement variable name with values in the Range r. N Y?
Range * copy_symbol_agreement_variable (char *name) Copy the value of the Range agreement variable name. N Y?

Other Functions

Unicode conversions: These are basic functions which may also be provided by some standard package.

Type Function Comment hfst-lexc hfst-twolc
unsigned int utf8_to_int (char *s) Integer value for a utf8 character s. N  
char * int_to_utf8 (unsigned int c) Utf8 character for an unsigned integer value c. N  

Legacy functions from SFST (not supported in the API):

TransducerHandle insert_pair_string (TransducerHandle t, char *s, KeyTable *T) Insert symbol pair string s without multi-character symbols into transducer t, which is a trie. N N
TransducerHandle read_words (char *filename, KeyTable *T, TransducerHandle tokenizer?) Minimal disjunction of all words listed in file filename. N N
vector <char*> string_paths (TransducerHandle t, bool spaces=false) Retrieve all paths from initial to final state in transducer t in text format. The parameter ‘spaces’ indicates whether the print names of symbol pairs are separated by spaces. N N
void print_transducer_paths (TransducerHandle t, FILE *outfile=stdout) Print all paths from initial to final state in transducer t to FILE outfile. N N


-- KristerLinden - 01 Oct 2008

Topic revision: r20 - 2009-09-30 - ErikAxelson
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback