Additional Linguistics Resources

The purpose of this page is to list additional resources that might need to be included to the Language Bank services.


  1. Treebanks:
  2. The Gothenburg Project
  3. LCD membership, ELRA/DA,...
  4. Word-aligned Bible corpus (Aika Oy)
  5. Melamnu project (IAAS)
  6. Paikannimirekisteri (KOTUS: Toni Suutarinen)
  7. käännösesimerkkejä (KOTUS: Alexander Paile)
  8. Kielitoimiston sanakirja (KOTUS, Toni Nykänen)
  9. University of Helsinki Language Corpus Server (Pirkko Suihkonen)
  10. Corpus of Early English Correspondence (CEEC) on vanhoista englanninkielisistä yksityiskirjeistä koostuva elektroninen tekstiaineisto eli korpus. Aineisto käsittää 778 kirjoittajaa ja noin 6000 kirjettä (yht. 2,7 miljoonaa sanaa) vuosilta 1410-1681. Aineistoon liittyy tekstin lisäksi annotaatioita ja metadataa. Korpuksesta on kaksi versiota: tavallinen raakateksti muutamin tekstitason koodauksin sekä kielellisesti annotoitu versio, johon on mm. lisätty jokaisen sanan kohdalle sen sanaluokka. Olemme koodanneet kunkin kirjeen alkuun tunnistetietoja COCOA-formaatissa. Tämän lisäksi meillä on kaksi Excel-tietokantaa - letter database ja sender database - jotka sisältävät tarkemmat tiedot kirjeistä ja niiden kirjoittajista. Collection Editor: VARIENG research unit,, home page:, Point Of Contact: Terttu Nevalainen.
  11. utairc, University of Tampere Information Retrieval Corpus. Copyright holders: Aamulehti, Kauppalehti, Keskisuomalainen. Collection Editor: Tampereen yliopiston informaatiotutkimuksen laitos. Points of contact: Aamulehti - Kari Hurtola (uta fi), Irmeli Toivanen (keskisuomalainen fi).

More Content in Lemmie

  1. Lemmieen Gutenberg-aineistoa
  2. Suomenruotsin korpuksen lisäykset
  3. Lisää kirjat-korpus Lemmieen.

Statistics, Statistical Machine Translation and Language Models

  2. GIZA++
  3. SRILM
  4. Morfessor
  5. alignment tools
  6. matlab, S

Finite-State Methods

  1. Edinburg Speech Tools Library
  2. BETA
  3. SFST
  5. Daciuks FSA
  6. FSA Utilities
  7. WFST
  8. ASTL
  9. MIT FST
  10. Carmel primer
  11. Carmel
  12. UNITEX
  13. Software:REGI
  14. The following Xerox Finite-State Tools are no more available at CSC:
    • xtokenize-2.3.3
    • xtwolc-3.2.4
    • xlexc-3.5.4
    • xlookup-2.3.5
    • xfst-8.3.1
    • tokenizer.fsto

UNICODE-Related Software

Publication tools

  1. GhostView, GSView, GhostScript, dvips, dvipdf,
  2. AccessPdf
  3. Troff, Groff, Gpic, Graphiviz/dot
  4. LaTeX and TeX: Using Latex on the server, Using IPA with Latex (i.e. International Phonetic Alphabet)
    • pdftex, pdflatex, makeindex
  5. fonts for TeX: IPA..
  6. TeX-macros: ITF - ITF (Interlinear Text Formatter) is a set of TeX macros for typesetting interlinear texts. For TeX 2.9 or 3.
  7. latex2html

Speech Technology Tools

  1. WaveSurfer
  2. Festival
  3. freeTTS

Linguistic Analysers and Databases

  1. WordNet - a newer version needed
  2. EuroWordnet
  3. Brill's tagger

Parsing systems

  1. OpenCCG
  2. VISL Constraint Grammar
  3. Link Grammar
  4. CG

General Purpose Programming Languages

  1. Scheme
  2. Franz Lisp - LISP would definitely be needed. Allegro Common Lisp was used at Univ. of Helsinki, but currently there may be free alternatives available, including CMUCL, CLISP and GCL (Gnu Common Lisp). Somebody ought to study these alternatives. See the end of the article in for a list of Lisp implementations. -- KimmoKoskenniemi - 16 May 2006 - 9:20
  3. Oz

Further Links and Information

Comments and wishes:

Non-Linux Software (not to be used in the Language Bank)

Some XP software of interest but not available for Linux

  • AdaptIt - Adapt It provides tools for translating text from one language to a related language. No linguistic analysis is performed.
  • CarlaStudio - CarlaStudio is a program that you use both to model languages and then to put the model to work parsing texts and adapting texts to another language. There are three tasks that you can perform with this version of CarlaStudio: Language Modeling, Text Processing, and Parse Problem Fixing.
  • Dictionary Development Process - a set of tools and techniques designed to facilitate the development of dictionaries for minority languages
  • Document Preparation Aids
    • WRDCHG makes changes to the words of a text, while preserving capitalization, punctuation, and formatting.
    • SYLCHK identifies potential spelling errors in text, using decomposition into syllables as the method for identifying possible errors, and returns these as a list.
    • SYLCOR is an interactive editor for correcting potential errors, using the same method as SYLCHK for finding possible errors.
    • SPLCOR is an interactive editor for correcting potential errors, based on word lists of known correct words.
    • HYPHEN introduces a user-determined character at syllable boundaries. It uses a different mechanism for identifying syllables than SYLCHK and SYLCOR.
    • DELIM checks text to see that delimiters (characters like quote marks, brackets, braces, parentheses, and so on) are paired and properly nested.
  • IPA Help - Use this program to learn the sounds of the IPA by clicking on a character in the chart
  • [[][IT] - a set of software tools for developing a corpus of annotated interlinear texts
  • PhoneBox - Utility for phonological analysis using language data structured by Standard Format Markers.
  • RTF2SFM - RTF2SFM converts a styled Word .RTF file to UTF8-encoded SFM (standard format markers).
  • Toolbox - Toolbox is a data management and analysis tool for field linguists. It is especially useful for maintaining lexical data, and for parsing and interlinearizing text, but it can be used to manage virtually any kind of data.
  • WinCecil - Use this program to view speech recordings, automatic pitch contours, and spectrograms. Recording limit is 3 seconds.
  • SpeechTools
  • a multimedia vocabulary learning tool; like an electronic flash card system with audio

Some Mac/Win software not available for Linux

  • Consistent Changes - CC is like the find-and-replace feature in a text editor, but much more powerful because it allows you to make changes which take context into consideration and to make a whole set of changes at once. a PDF course
  • TECkit - TECkit is a low-level toolkit intended to be used by other applications that need to perform encoding conversions
  • Cluster Analysis
  • Conc - Conc is a program for the Macintosh that produces keyword-in-context concordances of words in a text. It can handle both ordinary flat text and multiple-line interlinear text.
  • HyperBIBTEX -a HyperCard application for managing bibliographic databases in a format compatible with BIBTEX
  • IT - a set of software tools for developing a corpus of annotated interlinear texts
  • Rook - System for authoring descriptive grammars in HyperCard.
  • Word Format - Change MS Word documents
  • Word Surv - an aid for etermining linguistic relationships through the comparison of word lists
  • WordCorr - goes beyond WordSurv in that it forms and organizes all the correspondence sets implied by the linguist's judgments of comparability
  • WorldPad - an editor with the ability to display text in complex scripts using Graphite

Topic revision: r10 - 2006-12-19 - AnssiYliJyra
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback