Difference: HfstApplicationTutorial (8 vs. 9)

Revision 92017-02-20 - SamHardwick

Line: 1 to 1
 
META TOPICPARENT name="HfstHome"
Line: 451 to 451
 The pattern matching application hfst-pmatch can be used to build a trivial tokenizer by itself:
Added:
>
>
set need-separators off
 define nonword Whitespace | Punct; define word LC([nonword | #]) [\nonword]+ RC([nonword | #]); define token word | Punct;
Line: 474 to 475
 :
Changed:
<
<
To directly produce the tokens in some serialized format, there's hfst-proc2 (to be renamed in the near future). By default it outputs one token per line, and nothing else (meaning text not identified as being part of a token is suppressed). There are various output options to support piping the stream of tokens for further processing.
>
>
To directly produce the tokens in some serialized format, there's hfst-tokenize. By default it outputs one token per line, and nothing else (meaning text not identified as being part of a token is suppressed). There are various output options to support piping the stream of tokens for further processing. It also allows you to skip writing the tokenization rules by directly running it on a HFST transducer, in which case it produces a simple default tokenizer based on the accepted strings in the transducer and typical punctuation awareness.
  Of course, tokenization isn't quite as simple as that. In English, for example, applications typically expect contractions (like isn't) comprising multiple words to produce multiple tokens (is and n't). We might redefine our tokenizer to know about the finite number of English contractions:
Added:
>
>
set need-separators off
 define contraction {'m} | {n't} | {'re} | {'s} | {'ll} | {'d} | {'ve} ; define nonword Whitespace | Punct; define word LC([nonword | #]) [\nonword]+ RC([[nonword - "'"] | # | contraction]);
Line: 491 to 493
 Besides refinements like this, or allowing hyphens inside words, or phenomena like compounding in other languages, we may want to consider some multiword units to be single tokens. For example, for must purposes it would make sense to consider New York or car park to be single tokens. Issues like this may become complex and require a considerable amount of linguistic data (not to mention decision-making). If we have a satisfactory finite-state dictionary (or morphology) that does what we want, we can largely piggy-back our tokenizing on top of it:
Added:
>
>
set need-separators off
 define morphology @bin"/path/to/morphology.hfst";

! If the morphology has morphological tags in the output, we might

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback