SWIG bindings for HFST

SWIG is a compatibility layer between C/C++ code and higher level languages. For HFST, the library has been wrapped with SWIG to produce Python bindings. The bindings work with Python versions 2 and 3. However, Python 3 has a better support for UTF-8 symbols so it is probably a better choice for linguistic use.

For installing the bindings, see our download page.

The SWIG bindings aim to duplicate functionality from the C++ API, with certain alterations and extensions documented on this page. Documentation on the Python API is available here.

Changes due to Python conformance

The HFST API defines various C++ operators with methods like operator=(). As operator= is an illegal label in Python, and names of this type are generally cumbersome, these are also available under the following names:

  • operator=() is assign()
  • operator<<() is redirect()
  • operator[]() is get() if the function is declared const, __getitem__() otherwise (this is a special Python name that emulates the [] operator behaviour

HfstBasicTransducer uses iterators that can be problematic in Python, so the following functions are also available (both in Swig and C++):

  • std::vector < HfstState > states() const
  • const HfstTransitions & transitions(HfstState s) const

Swig-specific extensions

C++ side

To make HFST's containers easier to use in Python and to ensure that Python users get actual objects instead of pointers, various extension functions and objects have been written on the C++ of the wrapper. They currently reside in hfst_swig_extensions.h in the swig directory and are as follows:

To get the object pointed by a pointer:

HfstTransducer ptrvalue(const HfstTransducer * t) returns a copy of the object pointed by t

To avoid the use of reference parameters:

HfstTwoLevelPaths extract_paths(const HfstTransducer &t, int max_num=-1, int cycles=-1) returns an STL container instead of taking one as a reference parameter
HfstTwoLevelPaths extract_paths_fd(const HfstTransducer &t, int max_num=-1, int cycles=-1, bool filter_fd=false) the same as above

To offer an easier structure for representing paths in a transducer:

class HfstPath represents a path in a transducer, has members float weight, string input and string output, the strings are equal if the path represents only one level
typedef HfstPathVector represents paths in a transducer, a vector of the above

To transform the detailed and complex structures HfstOneLevelPaths and HfstTwoLevelPaths into HfstPathVectors:

HfstPathVector detokenize_paths(HfstOneLevelPaths * holps) returns a list of paths where the strings are concatenations of all the elements in holps without symbols that represent flag diacritics
HfstPathVector detokenize_paths(HfstTwoLevelPaths holps) the same as above

NOTE: HfstOneLevelPaths contains only one level, so it is converted to an HfstPathVector where the input and output members are the same.

The following examples demonstrate how these functions can be used:


# Print all string pairs recognized by a transducer and their weights.
paths = libhfst.extract_paths(transducer) # returns input and output levels
for path in libhfst.detokenize_paths(paths):
        print "%s:%s  %f" path.input path.output path.weight

# Print all results for a given input string and the corresponding weights.
# lookup is implemented only for transducers in optimized lookup format
transducer.convert(libhfst.hfst_olw_type())
for path in libhfst.detokenize_paths(transducer.lookup("foo")): 
        print "%s  %f" path.output path.weight  # lookup returns one level, so we could also use path.input

If you are interested in how paths in a transducer are tokenized or what flag diacritics they contain, the following functions are also available.

OneLevelPathVector vectorize(HfstOneLevelPaths * holps) returns an actual vector object containing "one level paths". This is in practice an STL container of the type vector<pair <float, vector<string> > >
OneLevelPathVector purge_flags(OneLevelPathVector olpv) returns a OneLevelPath vector without symbols that represent flag diacritics
HfstPathVector detokenize_vector(OneLevelPathVector olpv) returns an HfstPathVector where the strings are concatenations of all the elements from a particular OneLevelPath

TwoLevelPathVector vectorize(HfstTwoLevelPaths holps) returns an actual vector object containing "two level paths". This is in practice an STL container of the type vector<pair <float, vector< pair < string, string > > > >
TwoLevelPathVector purge_flags(TwoLevelPathVector tlpv) returns a TwoLevelPath vector without symbols that represent flag diacritics
HfstPathVector detokenize_vector(TwoLevelPathVector tlpv) returns an HfstPathVector where the strings are concatenations of all the elements from a particular TwoLevelPath

Actually, the detokenize_paths functions presented above are a composition of these functions, being equal to detokenize_vector(purge_flags(vectorize(...))).

Because of Swig function shadowing, the following functions

HfstTransducer::substitute(const HfstSymbolSubstitutions&)
HfstTransducer::substitute(const HfstSymbolPairSubstitutions&)

must be called in Python as

HfstTransducer::substitute_symbols(const HfstSymbolSubstitutions&)
HfstTransducer::substitute_symbol_pairs(const HfstSymbolPairSubstitutions&)

Python side

lookup_clean(transducer, string) takes a transducer and a string and returns a Python list of HfstPaths of detokenized, flag-purified results
print(transducer) prints an HfstTransducer or an HfstBasicTransducer

Using the Swig bindings

Differences between C++ and Python

C++ Python Note
include <HfstTransducer.h> import libhfst All library functionalities are in a module.
hfst::implementations.HfstBasicTransducer t; t = libhfst.HfstBasicTransducer() Namespaces are not used.
hfst::StringPairSet alphabet = t.get_alphabet(); alphabet = t.get_alphabet() Type of objects is not explicitly defined by the user.

Automatic type conversions

C++ Python C++ example Python example
string string
"foobar"
'foobar'
pair tuple
pair<string>("foo", "bar")
'foo', 'bar'
set tuple
HfstTransducer transducer('foo', 'bar', SFST_TYPE);
alphabet = transducer.get_alphabet();
transducer = libhfst.HfstTransducer('foo', 'bar', libhfst.SFST_TYPE)
print(transducer.get_alphabet())
#prints ('@_EPSILON_SYMBOL_@', '@_IDENTITY_SYMBOL_@', '@_UNKNOWN_SYMBOL_@', 'foo', 'bar')
vector tuple
HfstTokenizer tok;
vector<pair<string, string> > tokenization = tok.tokenize("foo"); 
tok = libhfst.HfstTokenizer()
print(tok.tokenize('foo'))
#prints (('f','f'),('o','o'),('o','o'))
map dictionary
map<string, string> substitutions;
substitutions["a"] = "A";
substitutions["b"] = "B";
substitutions["c"] = "C";
substitutions["d"] = "D";
substitutions = { 'a':'A', 'b':'B', 'c':'C', 'd':'D' }
NULL None
HfstTransducer * transducer = lexc_compiler.compileLexical();
assert(transducer != NULL)
transducer = lexc_compiler.compileLexical()
assert(transducer != None)

Caveats

Note that you must use the function assign instead of Python's = in self-assignment. For example, if you have defined

>>> import libhfst
>>> tr = libhfst.HfstTransducer('a', 'b', libhfst.foma_type())

and write

>>> tr = tr.invert()
>>> print tr

you will get an error message, something like

terminate called after throwing an instance of 'FunctionNotImplementedException'
Aborted

Instead, you must write

>>> tr.assign(tr.invert())
>>> print tr

or

>>> tr_inverted = tr.invert()
>>> print tr_inverted

Shortcomings

Missing functions

The following functions that take a C++ function as a parameter are not supported in Python:

HfstTransducer &transform_weights(float (*func)(float));
HfstTransducer &substitute(bool (*func)(const StringPair &sp, StringPairSet &sps));

The following functions are not yet available:

// Functions in namespace hfst::rules

The following datatype conversion do not work yet:

StringPairSet

Issues on Windows

Demos

-- SamHardwick - 2011-08-16