HFST: Ideas for Software project

If you are interested in participating please come to IRC channel #hfst on Freenode network (link: irc://Freenode/#hfst , if your browser and operation system supports IRC links). If you cannot use IRC for some reason, use hfst-development@lists.sourceforge.net to ask questions. If you can sign to this wiki you may use the given wiki page to ask questions and document your project details.

Task Difficulty Description Rationale Requirements Mentors Questions & Discussion
Convert existing flex+yacc parsers to spirit or other proper object oriented system Hard The old parsers for xerox's regular expressions, twolc rulesets, etc. etc. should be rewritten with boost::spirit. The flex+yacc parsers are hard to use properly on different data sources (files, strings, interactive) or multiple times in one tool or program. Parsers based on spirit should provide slightly better ways to support this. C++, boost::spirit TommiPirinen HfstParserCollection
Unify parser interfaces converting various dictionaries to HFST automata Medium Create parser for other system's dictionary and/or rule formalisms to convert to FST format. Suitable candidates are e.g. apertium (XML), hunspell (partially done), aspell, ispell, etc. Ultimately it would work towards having single (modular) utility for creating language models from varying combinations of source language descriptions, rules and everything Support for existing language models and dictionaries is crucial for HFST to be useful. C++, (XML) TommiPirinen HfstParserCollection
Graphical installer and manager for dictionaries and language models Entry-Level Create graphical system for easy installation and management of dictionaries for average end-users. The installation and management of dictionaries is currently unnecessarily hard and only way for average end-user to use it is with application-specific packages, such as OpenOffice.org plugin with pre-installed dictionary. Any scripting language with support for linux, mac os x and windows and neat GUI (possibly some c++ hacking for starts of language pack support in command-line tools) TommiPirinen HfstLanguagePackages
Create and improve test-utilities for FST dictionaries and rules Medium Making better tools for regression testing language models during development and locating the actual errors. Also one project would be to help in development of current language modeling formalisms towards Knuth's literate programming. Current tools for testing that language models created with FST tools actually work are based on simple corpus coverage tests and simple path and traversal checks in the automaton, more easily automated tests would help developing and using language models rapidly. C++ TommiPirinen HfstTestTools
Create web demos for HFST's language models Medium Creating web applications for using HFST based language models and dictionaries online. Possibly requiring to build a service that keeps FST dictionaries loaded in memory for the web use. Current web demos are relatively simple CGI scripts calling command line tools to display simple results mostly unformatted. This also requires loading all language models for each web page view. End result is slow and not very neat or modern, and hard to update. CGI, ajax, etc. web technologies TommiPirinen HfstWebDemosGsoc
Create word games using HFST-based language models Medium Creating various simple word-games showing use of HFST based dictionaries in interesting real world application. With capability to convert all hunspell and xerox morphologies into dictionaries we already have available large variety of dictionaries for nice word game applications. Examples of such word games can be seen in web services such as Mindjolt games or yahoo. The games can be made as Facebook games, standalone webapps or standard applications. Actually word games are a neat way to collect missing words into dictionaries as well showing off that the dictionaries are useful and whatnot; the use of word-games as means of collecting out-of-vocabulary words is on my current research agenda so you'll get a lot of support for this one wink C++, CGI or even flash(?) TommiPirinen
Improve corpus handling tools Medium Create better corpus analysis tools extending the current systems. The unrecognised words can often be analysed using spelling correction or falling back to other analysers or guessers. All of the separate functionality exists so it's just a matter of integrating the additional systems to analysis. The work also includes investigating the best way to use different fallbacks. The optional post-blank/pre-blank formula of apertium's corpus tools needs to be included in HFST. Better handling for Unicode (e.g. via ICU). Current HFST corpus tools are still not on par with features needed for Apertium or other users of HFST language models. C++ TommiPirinen HfstCorpusTools
Improve bindings for scripting languages Medium Create libraries and bindings for HFST for scripting languages, such as python or ruby. One possible project would be integration to popular natural language tool kits, such nltk (natural language tool kit). Main programming language most of language tech. students learn is not C++, java or even unix shell, but python. HFST needs to be provided for them as well. python or ruby, C++ TommiPirinen HfstScriptLanguageBindings
Topic revision: r6 - 2012-03-19 - TommiPirinen
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback