This web is for holding topics deemed as old or irrelevant for KitWiki. If you think the topic doesn't belong here, please check that it's named properly (is a WikiWord) and descriptively, contains relevant data, and is put back to a relevant web.


Helsinki Corpus of Swahili


The Helsinki Corpus of Swahili (HCS) is an annotated corpus of Standard Swahili text. It contains news texts from several current Swahili newspapers as well as from the news site of Deutsche Welle. It also contains extracts from a number of books containing prose text, including fiction, education and sciences.

HCS has been annotated with SALAMA (Swahili Language Manager), a multi-purpose language management environment, developed at the University of Helsinki by Arvi Hurskainen, Professor of African languages. The corpus contains information of such features as the base form of the word (lemma), part-of-speech, and morphology, including noun class affiliation and verb morphology. It also contains the etymology of loan words and glosses in English.

Home Page:

Version and Size

Version: The corpus has no version information.

Size: The total size of the corpus is 12.5 million words.

Content and Structure

subcollection title directory documents tokens
Alasiri articles/alasiri/ 2779 1125958
An-nuur articles/annuur/ 659 837990
Books books/ 72 1055425
Dwelle articles/dwelle/ 9831 2479606
Kasheshe articles/kasheshe/ 7 15388
Kiongozi articles/kiongozi/ 70 828256
Komesha articles/komesha/ 1 8948
Lengo articles/lengo/ 6 2347
Majira articles/majira/ 6992 3309197
Mfanyakazi articles/mfanyakazi/ 19 7503
Mzalendo articles/mzalendo/ 451 366307
Nipashe articles/nipashe/ 4086 2019471
Rai articles/rai/ 33 242281
Uhuru articles/uhuru/ 816 311581

Directory in the Corpus Server


Directory Listing



Access Rights and Conditions

Warning: Can't find topic KitWiki.Resource_hcs_ConditionsOfUse

The Group of Unix Users Having Access to the Resource: swahili


Making Bibliographical Reference to the Material:

Corpus texts may be cited as is needed in research reporting. When HCS is used in research, due reference to the corpus must be made, e.g.:

HCS 2004. Compilers: Institute for Asian and African Studies (University of Helsinki) and CSC – Scientific Computing Ltd.

Other References

Release Notes and Details

All texts have been manually edited and typing errors of the original texts corrected. However, a small amount of mistakes remains in the texts. Also, because the annotation was carried out without human intervention or checking, some mistakes inevitably remain also in annotation.

Sending Bug Reports

To be copied to:
To be seen at:
*See also other resources: in KitWiki, in
All users may add their comments to Resource__Comments

When editing, please move cursor to the form below. Do not add anything here.

Topic revision: r13 - 2008-11-10 - HennaRiikkaLaitinen
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback