Character Sets of the Corpora at the UHLCS

The use of the character sets in the corpora located at the UHLCS reflect the history of corpus linguistics and machine-readable linguistic data. The original texts which typically are books and newspapers are prepared with different kinds of text editing and type-setting programs and character sets. The original texts have been, or they are scanned into the machine-readable form. When the data were adapted into the UNIX operating system, information in the original documents were saved. In the first years, the texts were adapted into the UNIX operating system with the seven-bit ASCII code. If in the original texts contained characters which were not available in the ASCII character set, these characters were replaced with a combination of two or several characters. Later, when the eight-bit Latin-1 character set with various extensions was available in the UNIX operating system, also the corpora were adapted into the Latin-1 form. As soon as the UNICODE character sets for different alphabet systems became publicly available, also the electronic data in language archive were started to be converted into the UNICODE form. At the UHLCS this concerns in particular corpora which originally were prepared with the Cyrillic alphabet system. When the data which originally was written with the Cyrillic alphabet system was adapted into the UNIX operating system it was converted into the Latin-1 character set. The first attempts to convert these corora into the utf-8 character sets were done with the financial support of the ECHO project. The work is still in progress (Dec. 2007): in this phase of work, in the data directories there also is a sub-directrory which contains basic scripts which can be used in converting the data into the utf-8 form (the name of the directories: XXX-in–preparation (XXX = abbreviation of the name of the language). The system used in converting manually the corpora of the Uralic languages originally written with the Cyrillic alphabet system into the Latin-1 alphabet is described in the following document: (1997) Documentation of the Computer Corpora of Uralic Languages at the University of Helsinki. Technical Reports, No. TR-2. Helsinki: Department of General Linguistics, University of Helsinki. Pp. 10–15.
(Pirkko, 2007-2008)

