Helsinki Corpus of Swahili


The Helsinki Corpus of Swahili (HCS) is an annotated corpus of Standard Swahili text. It contains news texts from several current Swahili newspapers as well as from the news site of Deutsche Welle. It also contains extracts from a number of books containing prose text, including fiction, education and sciences.

HCS has been annotated with SALAMA (Swahili Language Manager), a multi-purpose language management environment, developed at the University of Helsinki by Arvi Hurskainen, Professor of African languages. The corpus contains information of such features as the base form of the word (lemma), part-of-speech, and morphology, including noun class affiliation and verb morphology. It also contains the etymology of loan words and glosses in English.

Version and Size

Version: The corpus has no version information.

Size: The total size of the corpus is 12.5 million words.

Content and Structure

subcollection title directory documents tokens
Alasiri articles/alasiri/ 2779 1125958
An-nuur articles/annuur/ 659 837990
Books books/ 72 1055425
Dwelle articles/dwelle/ 9831 2479606
Kasheshe articles/kasheshe/ 7 15388
Kiongozi articles/kiongozi/ 70 828256
Komesha articles/komesha/ 1 8948
Lengo articles/lengo/ 6 2347
Majira articles/majira/ 6992 3309197
Mfanyakazi articles/mfanyakazi/ 19 7503
Mzalendo articles/mzalendo/ 451 366307
Nipashe articles/nipashe/ 4086 2019471
Rai articles/rai/ 33 242281
Uhuru articles/uhuru/ 816 311581

Access Rights and Conditions

The Group of Unix Users Having Access to the Resource: swahili


Making Bibliographical Reference to the Material:

Corpus texts may be cited as is needed in research reporting. When HCS is used in research, due reference to the corpus must be made, e.g.:

HCS 2004. Compilers: Institute for Asian and African Studies (University of Helsinki) and CSC – Scientific Computing Ltd.

Other References

Release Notes and Details

All texts have been manually edited and typing errors of the original texts corrected. However, a small amount of mistakes remains in the texts. Also, because the annotation was carried out without human intervention or checking, some mistakes inevitably remain also in annotation.

Sending Bug Reports

