Difference: CharacterSetsEng (3 vs. 4)

Revision 42008-01-13 - PirkkoSuihkonen

Line: 1 to 1
 
META TOPICPARENT name="CorpusMigrationMarkup"

Character Sets of the Corpora at the UHLCS

Added:
>
>
The use of the character sets in the corpora located at the UHLCS reflect the history of corpus linguistics and machine-readable linguistic data. The original texts which typically are books and newspapers are prepared with different kinds of text editing and type-setting programs and character sets. The original texts have been, or they are scanned into the machine-readable form. When the data were adapted into the UNIX operating system, information in the original documents were saved. In the first years, the texts were adapted into the UNIX operating system with the seven-bit ASCII code. If in the original texts contained characters which were not available in the ASCII character set, these characters were replaced with a combination of two or several characters. Later, when the eight-bit Latin-1 character set with various extensions was available in the UNIX operating system, also the corpora were adapted into the Latin-1 form. As soon as the UNICODE character sets for different alphabet systems became publicly available, also the electronic data in language archive were started to be converted into the UNICODE form. At the UHLCS this concerns in particular corpora which originally were prepared with the Cyrillic alphabet system. When the data which originally was written with the Cyrillic alphabet system was adapted into the UNIX operating system it was converted into the Latin-1 character set. The first attempts to convert these corora into the utf-8 character sets were done with the financial support of the ECHO project. The work is still in progress (Dec. 2007): in this phase of work, in the data directories there also is a sub-directrory which contains basic scripts which can be used in converting the data into the utf-8 form (the name of the directories: XXX-in–preparation (XXX = abbreviation of the name of the language). The system used in converting manually the corpora of the Uralic languages originally written with the Cyrillic alphabet system into the Latin-1 alphabet is described in the following document: (1997) Documentation of the Computer Corpora of Uralic Languages at the University of Helsinki. Technical Reports, No. TR-2. Helsinki: Department of General Linguistics, University of Helsinki. Pp. 10–15.
 
Deleted:
<
<
The character sets used in corpora located at the UHLCS reflect the history of deceloping machine-readable corpora. The original texts which typically are books and newspapers are prepared with different kinds of text editing and type-setting programs and font systems. The original texts were in the machine-readable form, or they are scanned into the electronic form. When the data were adapted into the UNIX operating system, they were edited in the way that all the information in the texts were saved. In the case that the original texts were prepared in the way that there had been several levels in preparing the document (text editing, type setting), these levels were opened separately. After that the data were adapted into the UNIX operating system. In the first years, the texts were adapted into the seven-bit ASCII-code. If in the original texts contained characters which were not available in the ASCII character set, these characters were replaced with a combination of two or several characters. Later, when the eight-bit Latin-1 character set with various extensions was available in the UNIX operating system, also the corpora were adapted into the Laten-1 form. As soon as the UNICODE character sets for different alphabet systems became publicly available, there were several efforts in order to convert the corpora into the UNICODE form. This concerns in particular corpora which originally were prepared with the Cyrillic alphabet system. The data written with the Cyrillic alphabet system and adapted into the Latin-1 character set was planned to be converted back into the Cyrillic alphabet system marked with the UNICODE with the help of small perl-programs, scripts. This work was started with the financial support of the ECHO project (http://www.ling.lu.se/projects/echo/). The work is still in progress (Dec. 2007): in this phase of work, in the data directories there is a sub-directrory which contains basic scripts which can be used in converting the data (the name of the directories: XXX-in–preparation (XXX = abbreviation of the name of the language).
 
Deleted:
<
<
The description of the system used in converting the corpora of the Uralic languages originally were written with the Cyrillic alphabet system into the Latin-1 alphabet is given in the following document:
 
Changed:
<
<
Pirkko Suihkonen. 1997. Documentation of the Computer Corpora of Uralic Languages at the University of Helsinki. Technical Reports, No. TR-2. Helsinki: Department of General Linguistics, University of Helsinki. Pp. 10–15.
>
>
The description of the system used in converting the corpora of the Uralic languages originally were written with the Cyrillic alphabet system into the Latin-1 alphabet is given in the following document: Pirkko Suihkonen. 1997. Documentation of the Computer Corpora of Uralic Languages at the University of Helsinki. Technical Reports, No. TR-2. Helsinki: Department of General Linguistics, University of Helsinki. Pp. 10–15.
 
Changed:
<
<
(Pirkko, marraskuu 2007) In preparation...
>
>
(Pirkko, 2007-2008)
 

Character sets in the corpora

A. general-linguistics
Line: 48 to 16
 
    1. cushitic-lgs
      1. somali
        • forthcoming
Changed:
<
<
          • The character set: Scandinavian alphabet in which '', '', '' and corresponding capital letters are marked with numeral codes: \202, \224, \216.
>
>
          • The character set: Character encoding: Scandinavian alphabet in which '', '', '' and corresponding capital letters are marked with numeral codes: \202, \224, \216, etc.
 
          • Structural encoding: the corpus is organized as follows: the Finnish data is translated into Somali a sentence by sentence. The Finnish and Somali sentences are marked with different kinds of tags.
        • metadata-descriptions
    1. semitic-lgs
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback