This web is for holding topics deemed as old or irrelevant for KitWiki. If you think the topic doesn't belong here, please check that it's named properly (is a WikiWord) and descriptively, contains relevant data, and is put back to a relevant web.

salama


The Swahili Language Manager, a system for analysing Swahili texts

Description

SALAMA (Swahili Language Manager) is a multi-purpose language management environment, developed at the University of Helsinki by Arvi Hurskainen, Professor of African languages. It has been used e.g. to annotate the Helsinki Corpus of Swahili.

SALAMA is a result of 19 years work for describing Swahili computationally. In describing morphology it makes use of Two-Level morphology, and disambiguation is carried out mostly with a Constraint Grammar parser.

The users interested in compiling Swahili dictionaries on the basis of corpus texts might find SALAMA particularly useful. Because such activity is usually commercial, access to the use of SALAMA is not free. Those who are interested in such service should contact Arvi Hurskainen (arvi hurskainen (at)helsinki fi).

Version and Copyright Information

version:

copyright:

Usage

Commands
  • pre-process - to make the text optimal for linguistic analysis
  • analyze - the basic linguistic analysis of the text
    • analyze-simple-lemma - for extended verbs it gives the non-extended base form as lemma
    • analyze-no-glosses - leaves the English glosses out
    • analyze-no-glosses-simple-lemma - leaves English glosses out; simple verb base as lemma for extended verbs
    • analyze-only - the morphological analysis only, functions as analyze-snt
    • analyze-only-simple-lemma - gives as output a simple verb base as lemma for extended verbs
  • disambiguate - tries to resolve ambiguous analyses of words
    • disambiguate-only - expects that the text has already been analyzed. You should not try to run this for raw text!
    • disambiguate-simple-lemma - gives as output a simple verb base as lemma for extended verbs
  • one-line-format - moves the word-form and its analysis to the same line
    • one-line-format-only - expects that the text has already been analyzed and disambiguated
    • one-line-format-simple-lemma - gives as output a simple verb base as lemma for extended verbs
  • list-count-lemmas - gives lemma frequencies. A word with different interpretations is counted as a single word
    • list-count-lemmas-analyze - gives also some morphological information, as well as English glosses
    • list-count-lemmas-simple - gives the frequencies, a simple base form for extended verbs as lemma
    • list-count-lemmas-analyze-simple - combination of the above. This format suits for dictionary compilation

  • analyze-snt - Note that each program that expects pre-processed text as input has the extension -snt.
  • analyze-simple-lemma-snt
  • analyze-no-glosses-snt
  • analyze-no-glosses-simple-lemma-snt
  • disambiguate-snt
  • disambiguate-simple-lemma-snt
  • one-line-format-snt
  • one-line-format-simple-lemma-snt
  • list-count-lemmas-snt
  • list-count-lemmas-analyze-snt

  • prune-tags - It removes such tags that are not needed in compiling a dictionary
  • remove-num - It removes numbers found in text.
  • remove-propname - It removes proper names found in text.
  • remove-heur - It removes words for which the heuristic guesser has given an interpretation.
  • remove-lemma - It removes the lemma form of the analysis.
  • remove-token - It removes the word-form token from the analysis.

  • translate - It analyses text and produces a vertical form of the text, each word provided with such lexical information as normally found in dictionaries, and also the gloss in English is given. In order to improve readability a lot of morphological tags have been removed. The lemma form in this program is deleted.

  • vocabulary - It analyzes text and produces an alphabetical list of lemmas found in the text. Such lexical information that is normally included in dictionaries is included. Also glosses in English as well as the etymological tags are included. This format is the best approximation of the final dictionary that the system can produce automatically.
  • vocabulary-count - same as vocabulary, but it adds the frequency number in front of each lemma
  • vocabulary-less-top500 - These are programs that work as vocabulary, but they cut out the most common words
  • vocabulary-less-top1000
  • vocabulary-less-top1500
  • vocabulary-less-top2000
  • vocabulary-less-top2500
  • vocabulary-less-top3000

  • twol-r - the two-level run-time morphological analyzer
  • cg2run

Help, Manuals and Documentation

help commands:

further information:
More information on SALAMA can be found off-site at http://www.aakkl.helsinki.fi/cameel/corpus/salamainfo.htm.

Bugs

License Text

Other Information

Field of science: Linguistics

Available:
corpus

License: LicenseTypeAASitePaysTheCopy

To be copied to: https://wwwk.csc.fi/english/research/software/salama
To be seen at: http://www.csc.fi/english/research/software/salama
See also: KitWiki.SuomenKielipankki:Dev:Linguistics_Software, Old.ToolResources
The users may add their own comments to: ToolResource_salama_Comments

When editing, please move cursor to the form below. Do not add anything here.
Topic revision: r10 - 2008-11-21 - HennaRiikkaLaitinen
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback