Difference: CharacterSetsEng (2 vs. 3)

Revision 32007-12-22 - PirkkoSuihkonen

Line: 1 to 1
 
META TOPICPARENT name="CorpusMigrationMarkup"

Character Sets of the Corpora at the UHLCS

Changed:
<
<
The character sets used in corpora located at the UHLCS reflect the history of machine-readable corpora. The original texts which typically are books and newspapers are prepared with different kinds of text editing and type-setting programs and font systems. The original texts have been in the machine-readable form, or they are scanned into the electronic form. When the data have been adapted into the UNIX operating system, they have been edited in the way that all the information in the texts has been saved. In the case that the original texts have been prepared in the way that there have been several levels in preparing the document, all the levels have been opened separately. After that the data have been adapted into the UNIX operating system. In the first phase, the texts were adapted into the seven-bit ASCII-code. If in the original texts contained characters which were not available in the ASCII character set, that character was replaced with a combination of two or several characters. Later, when the eight-bit Latin-1 character set was available in the UNIX operating system, also the corpora were adapted into the Laten-1 form. As soon as UNICODE character sets for different alphabet systems became publicly available, also the corpora were started to convert into the UNICODE form. This concerns in particular corpora which originally were prepared with the Cyrillic alphabet system. In this process, the goal has been that the data can be converted into the UNICODE form automathically. For that reason there is a sub-directrory in the data directories which contains scripts which can be used in converting the data (the name of the directories: XXX-in–preparation (XXX = abbreviation of the name of the language).
>
>
The character sets used in corpora located at the UHLCS reflect the history of deceloping machine-readable corpora. The original texts which typically are books and newspapers are prepared with different kinds of text editing and type-setting programs and font systems. The original texts were in the machine-readable form, or they are scanned into the electronic form. When the data were adapted into the UNIX operating system, they were edited in the way that all the information in the texts were saved. In the case that the original texts were prepared in the way that there had been several levels in preparing the document (text editing, type setting), these levels were opened separately. After that the data were adapted into the UNIX operating system. In the first years, the texts were adapted into the seven-bit ASCII-code. If in the original texts contained characters which were not available in the ASCII character set, these characters were replaced with a combination of two or several characters. Later, when the eight-bit Latin-1 character set with various extensions was available in the UNIX operating system, also the corpora were adapted into the Laten-1 form. As soon as the UNICODE character sets for different alphabet systems became publicly available, there were several efforts in order to convert the corpora into the UNICODE form. This concerns in particular corpora which originally were prepared with the Cyrillic alphabet system. The data written with the Cyrillic alphabet system and adapted into the Latin-1 character set was planned to be converted back into the Cyrillic alphabet system marked with the UNICODE with the help of small perl-programs, scripts. This work was started with the financial support of the ECHO project (http://www.ling.lu.se/projects/echo/). The work is still in progress (Dec. 2007): in this phase of work, in the data directories there is a sub-directrory which contains basic scripts which can be used in converting the data (the name of the directories: XXX-in–preparation (XXX = abbreviation of the name of the language).
  The description of the system used in converting the corpora of the Uralic languages originally were written with the Cyrillic alphabet system into the Latin-1 alphabet is given in the following document:
Line: 147 to 179
 
  1. uralic-lgs
    1. baltic-finnic-lgs
      1. finnish
Changed:
<
<
        • a-contractL Latin-1
>
>
        • a-contract: Latin-1
 
        • b-contract: Latin-1
        • ftc: Latin-1
        • originals-copies-memos
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback