HFST: Dictionaries

This is a list of dictionaries and other language models or rulesets compiled and usable with HFST tools.

Morphological analysers

For morphologically simple languages a dictionary can be made of a word-form list which can be compiled to an HFST transducer with the simple tool hfst-fst2strings. More complex morphologies require tools that support more elaborate formalisms, e.g. LexC, TwolC, Xfst or SFST-PL.

HFST tools contain clones for all xerox finite-state morphology tools: hfst-lexc, hfst-twolc, hfst-xfst and hfst-compose-intersect. This listing shows the freely available language descriptions that have been compiled on these HFST tools.

The HFST tools also contain a parser for SFST-PL formalism, copied verbatim from also freely available SFST tools, called hfst-sfstpl2fst This list contains all free SFST morphologies that compile with HFST port.

Ready-compiled versions of the morphologies listed here are (or will be soon) available in HFST download pages.

Language Original source Licence HFST download Compilation notes
[it] Italian Morph-It LGPL   needs preprocessing, then hfst-strings2fst
[fr] French Morphalou Restricted non-free   needs preprocessing, then hfst-strings2fst
[se,smj,sma,smn,sms,sjd] Sámi (Northern, Lule, Southern, Inari, Skolt, Kildin) Giellatekno svn GPL   make hfst works, otherwise Xerox tools will be used
[fi] Finnish Omorfi homepages GPLv3 Omorfi downloads Optimised for HFST
[de] German Morphisto homepages GNU lesser GPL   hfst-sfstpl2fst
[kl] Greenlandic University of Tromsø SVN Unknown   Replace xfst with foma or hfst-xfst
[myv] Erzyan Giellatekno svn Unknown   Work in progress
[tr] Turkish TRMorph GPL   hfst-sfstpl2fst
[sv] Swedish see below CC ShareAlike 1.0 HFST downloads needs preprocessing, then hfst-sfstpl2fst
[es] Spanish (verbs) - GPL - hfst-lexc

Swelex

The Swelex morphological analyser for Swedish is based on Den stora svenska ordlistan (copyright (c) 2003 Tom Westerberg) and is distributed under the Creative Commons ShareAlike 1.0 license.

The Swelex sources and ready compiled transducers can be fetched from the HFST download pages The Swelex file is named hfst-swedish.tar.gz.

The file swedish.hfst contains the morphology in the standard hfst format and swedish.hfst.ol in the optimized lookup format. The directory src contains Krister Lindén's scripts to build Swelex.

Dictionaries (for spell-checking)

There're preliminary scripts for converting hunspell dictionaries to HFST automata. While hunspell dictionary format was later extended for morphological analyses, majority of dictionaries do not take advantage of this and it is not yet implemented for HFST.

Language
[pt-BR] Portugese (Brazil)
[pl] Polish
[cs] Czech
[hu] Hungarian
[se] Northern S\'{a}mi
[sl] Slovak
[nl] Dutch
[gsc] Gascon
[af] Afrikaans
[is] Icelandic
[el] Greek
[it] Italian
[gu] Gujarati
[lt] Lithuanian
[en-GB] English (Great Britain)
[de] German
[] Croatian
[es] Spanish
[ca] Catalan
[sl] Slovenian
[] Faeroese
[fr] French
[sv] Swedish
[en-US] English (U.S.)
[et] Estonian
[pt] Portugese (Portugal)
[] Irish
[] Friulian
[] Nepalese
[th] Thai
[] Esperanto
[il] Hebrew
[bg] Bengali
[] Frisian
[ia] Interlingua
[] Persian
[] Indonesian
[] Azerbaijani
[] Hindi
[] Amharic
[] Chichewa
[] Kashubian

Hyphenation patterns

The hyphenation patterns of TeX system can be compiled to regular finite-state automata and used with HFST tools. Here's the list of those.

Language
[no] Norwegian
[de-1996] German (Germany, 1996)
[de-1901] German (Germany, 1901)
[nl] Dutch
[en-GB] English (Great Britain)
[] Irish
[en-US] English (U.S.)
[hu] Hungarian
[sv] Swedish
[is] Icelandic
[et] Estonian
[ru] Russian
[cs] Czech
[] Ancient Greek
[uk] Ukrainian
[da] Danish
[sl] Slovak
[si] Slovenian
[es] Spanish
[fr] French
[ia] Interlingua
[el-polyton] Greek (Polyton)
[] Upper Sorbian
[gl] Galician
[ro] Romanian
[] Mongolian
[fi] Finnish
[ca] Catalan
[el-monoton] Greek (Monoton)
[] Serbian
[] Serbocroatian
[] Sanskrit
[] Croatian
[] Coptic
[] Latin
[bg] Bulgarian
[pt] Portuguese
[] Basque
[id] Indonesian
[tr] Turkish
[zh-Latn] Chinese (Pinyin)
Topic revision: r5 - 2017-11-18 - ErikAxelson
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback