Using Unicode (UTF-8) on Hippus

Unicode is becoming the preferred character encoding for linguistic data, for a general discussion on moving over to Unicode and various aspects to be considered, see When to Convert to Unicode at the SIL site.

Using Unicode instead of ISO-8859-1 (Latin1) on Hippu servers may require adjustments, choices and settings in order to work properly. Some of these steps are listed below, and they will be dealt in separate pages if the recipe is not a trivial one.

Fonts at the work station

The workstation should have fonts and facilities for displaying (sufficient subsets of) Unicode characters while editing and processing Unicode materials on Hippu servers.

Q: Which fonts are capable of displaying all or sufficient subsets of Unicode characters on a Windows workstation? Can they be used on particular terminal programs such as PuTTY or using WinAxe?

Keyboard driver of the work station

Some keyboard drivers can produce useful subsets of Unicode characters in a uniform way on a Windows workstation, such as the letters of all European Latin-based scripts including letters with diacritics. See the Kotoistus keyboard at http://kotoistus.fi/nappaimisto.htm for a keyboard layout and a driver for entering such characters. A driver like this makes the larger character set available across applications on the workstation.

Q: Is there a way to enter European latin based characters through a terminal program to applications (such as an editor) running on the Hippu servers?

A: On a Windows XT machine, one can load a test driver (Mikrosoft Suomalaisen monikielisen näppäimistön arviointiversio) from Microsoft pages, see a link at http://www.kotoistus.fi. At the same Microsoft page, there are instructions for installing the keyboard driver. A similar driver is available for Linux, see pointers at the same Kotoistus page. If PuTTY is in the UTF-8 mode, and the Unix/Linux server has has set e.g. export LC_CTYPE=fi_FI.utf8 then the keyed Unicode UTF-8 characters should enter as Unicode characters and the UTF-8 characters sent by the server should appear as correct glyphs on the PuTTY window. (Please watch that the keyboard driver stays in effect, reselect if necessary.) Using the Emacs editor in UTF-8 mode should succeed, if the settings for keyboard, file and display modes are for UTF-8 processing, see below for instructions for Emacs and Unicode.

Lacking a suitable keyboard driver, there are other ways to enter Unicode characters. On workstations, there may be ways to enter the Unicode character through their numeric values. On Emacs and other editors, there are ways to define key combinations (and keymaps) for producing characters of particular alphabets.

Q: How can one enter any Unicode character on a Windows workstation (i.e. even those which do not have a specific key combination on a driver)?

A: Click on the Windows Start menu, then All Programs on the lower left of your screen. Select Programs » Accessories » System Tools » Character Map. (C.f http://tlt.its.psu.edu/suggestions/international/bylanguage/ipa.html)

A: In an Emacs document operating over PUTTY in UTF-8 mode on Angarak use the command M-x ucs-insert to insert utf-8 entities. When writing combining diacritics the Emacs window is hard to follow with the naked eye, so you might have to start counting strokes when you want to insert glyphs on the same line. Why not just enter a line break after any word with a combining diacritic if you are set up to parse the line breaks.

Q: How do I get the Kotoistus keyboard to work in a PUTTY session in UTF-8 mode on Angarak, for the glyphs that don't show up in Finnish, French, German? Most consonant glyphs are not available. There are problems with the Kotoistus keyboard. If I write all the special glyphs in note pad and then copy and paste them to an open session of Emacs over PUTTY in UTF-8 mode on Angarak, I only have a problem with my putty font showing some combinations with caron, dot above, macron, ogonek, and stroke. If, however, I attempt to write the glyphs directly in the same session, I have two problems: (1) the base letter is sent, but no diacritic is combined with it; (2) some glyphs are show up in my PUTTY window as question marks "?", which must indicate that they have not been transfered over PUTTY. See results at http://www.ling.helsinki.fi/~rueter/TESTKotoistusKeyboard.xhtml

Terminal program at the workstation of the user, -- JackRueter - 2009-04-10

If one uses Microsoft Windows OS and PuTTY as the terminal program, there seems to be a way to display UTF-8 characters sent by the server if one selects the UTF-8 at Window and Translation in Change setting. This seems to work for version 0.60 of PuTTY but probably not for version 0.58.

Sending UTF-8 characters from the workstation to the server seems more problematic with PuTTY, which seems not promise support for keying in Unicode characters in character mode (Is this true?).

By using e.g. WinAxe for running an X-Window program on the server with its window on the workstation, one might succeed better. (But the Emacs on WinAxe does not display ¸ correctly whereas PuTTY does in UTF-8 mode.)

On a Macintosh, you can try this hint from http://www.macosxhints.com.

Converting ISO Latin-1 file into UTF-8

Use the GNU/Linux program iconv. You can find the parameters and usage by iconv --help and a list of available encodings by iconv --list (LATIN1 and UTF8 are the commonly used encodings).

Editor at the server

Emacs version 23 which is now the default Emacs on Hippus, handles Unicode (UTF-8) encoding nicely. Earlier versions were more tricky and required more tuning. Emacs has very good capabilities for working with corpora. See a separate page for Emacs and Unicode UTF-8

Q: Are there real competitors available? A: Maybe Yudit could be used, see http://www.yudit.org, or mined http://www.towo.net/mined/. A: Vim editor (http://www.vim.org) is capable of handling UTF-8 encoded files.

The operating system configuration

On the Linux server (such as Hippu), there are default settings which determine what coding is used by default. This is determined by the setting of the environment variables. Linux and Bash are Unicode enabled but some settings may be still be necessary.

Use export LC_CTYPE=fi_FI.utf8 to tell the server to use Unicode UTF-8 encoding.

Set the UTF-8 mode at the PuTTY terminal program in order to enter and display Unicode characters correctly (or as correctly as possible) in the PuTTY window.

For using less with Unicode UTF-8 files set export LESSCHARSET=utf-8 .

The following lines in the ~/.bashrc will tune the terminal and software for UTF-8 (unless the default values are already correct by the default settings):

export LESSCHARSET=utf-8
export LC_ALL=fi_FI.utf-8

If one has to revert to using eight-bit coding on the command line interpreter, one needs to set these environment variables accordingly to fi_FI.iso88591. (Maybe setting other variables such as TERM and LC_ALL, LANG, LANGUAGE or LC_CTYPE can sometimes help.)

Topic revision: r18 - 2012-03-13 - KimmoKoskenniemi
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback