HFST: Tokenizers for All

The purpose of this page is to investigate the tokenizer technology needed in typical FST applications. We hope to do this as joint work with the HunMorph groups in BUTE, because there are shared interests and some potentially synergic complementariness.

Requirements Survey

Scenario 1: Implementation of deterministic left-to-right longest-match tokenizer

GNU Flex works like this. It provides, in addition, scanner states and lookaheads.

Scenario 2: Implementation of non-determistic longest-match tokenizer

LEXC allows for non-deterministic tokenization of multi-character symbols in lexemes and in regular expressions.

Scenario 3: Implementation of a parser for sets of regular expressions

XFST can parse automata that encode regular expressions.

Scenario 4: User defined syntax for regular expressions.

Scenario 5: Comparison to Unitex and NooJ systems that use RTNs or more

Some Implementation Ideas

Bottom-up recognition of regular expressions using a fixed point semantics

This needs look-ahead that processes multi-character tokens only when ready.

HunMorph tokenizer

A tokenizer transducer that make long tokens only when sure

-- AnssiYliJyra - 20 Aug 2008

Topic revision: r2 - 2008-08-21 - AnssiYliJyra
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback