Emacs and UTF-8

Q: How do I display, edit and save UTF-8 files in Emacs?

Answer 1: If you consistently use UTF-8 encoding in files, you may:

  • The current Emacs 3.1 uses Unicode as the internal representation of characters, and is quite capable of handling UTF-8 encoded data. You can tell Emacs to use UTF-8 by adding the following three lines to your ~/.emacs file. The first line tells Emacs to save the files in UTF-8, the second line informs Emacs that the characters from your keyboard will come to Emacs in UTF-8, and the third line tells Emacs to display the text in the buffers in UTF-8:
    (prefer-coding-system 'utf-8-unix)
    (set-keyboard-coding-system 'utf-8-unix)
    (set-terminal-coding-system 'utf-8-unix)
  • You should see uuu at the left end of the Emacs buffer status line if the three facets of UTF-8 encoding are in effect in the buffer.
  • In addition, your terminal emulator program (e.g. Xterm, PuTTY) must be in UTF-8 mode, i.e. send characters to the server in UTF-8 encoding, and correctly display UTF-8 characters sent by Emacs. E.g. in Ubuntu Linux, the terminal program is in UTF-8 mode by default. On Windows, you probably use PuTTY as the terminal program and you must set the Window - Translation to UTF-8. In this way the characters you type on the keyboard will be sent in UTF-8 to Emacs and the UTF-8 characters Emacs sends to your screen will be displayed correctly. If your "" characters look something like this "¤" it may be because the translation setting is ISO-8859-1 (Latin-1) and the problem will be corrected if you change the translation to UTF-8. (Note that PuTTY may accept any UTF-8 codes when copy pasting, but your keyboard may still not pass all key combinations properly through PuTTY even if you can produce the characters in other applications.)
  • Nowadays, it is probably a good idea to start using Unicode consistently as the character set in your terminal program and use UTF-8 as the default character encoding for files on Hippu. With these settings, any new files you create with Emacs, are in UTF-8. Emacs 23.1.1 is quite clever in detecting the coding used in files with Finnish text, and it adjusts the coding appropriately. In addition, you may control the coding of the file by adding
    -*- coding: utf-8-unix -*- or -*- coding: latin-1-unix -*-
    on the first line of the file.

Answer 2: If you want to use both Latin-1 coded and UTF-8 coded files in a mixture, then you should try the following.

  • Use two saved terminal connections in PuTTY, e.g. one called Hippu-Latin1 (with ISO-8859-1 encoding) and the other Corpus-UTF-8 (with UTF-8 encoding).
  • Insert the following lines in your ~/.emacs file:
(defun utf8 ()
  (interactive)
  (prefer-coding-system 'utf-8-unix)
  (set-terminal-coding-system 'utf-8-unix)
  (set-keyboard-coding-system 'utf-8-unix)
  )
(defun latin1 ()
  (interactive)
  (prefer-coding-system 'latin-1-unix)
  (set-terminal-coding-system 'latin-1-unix)
  (set-keyboard-coding-system 'latin-1-unix)
  )
  • These will do nothing until you activate them by executing the command M-x utf8 or M-x latin1. According to the type of connection, and the coding you want to use in files, you may set the three modes using one command.
  • In order to tell Emacs that the file encoding of a particular file is UTF-8, you can insert a comment line, e.g. like the following one, to the top of all UTF-8 encoded files:
   !                       -*- coding: utf-8-unix -*-
  • Note that the way comments are marked depends of the type of the file. The first time when you create a new file, the coding will still be according to the default (i.e. Latin1). Thus, you have to save (C-x C-s) the file consisting of the comment line, and then reload it (e.g. C-x C-v). Then you should see the uuu at the lower left corner on the status line. Later on, when loading the file for further editing, the file encoding should be automatically set to UTF-8. On subsequent sessions, you must, of course remember to use the UTF-8 version of the terminal connection, and remember to re-execute the M-x utf8 command.


The above reasoning and instructions are based the following understanding, which you are urged to comment and correct, if you think it is not correct:

  • If one wants to see the Unicode characters (beyond Latin-1) on the PuTTY terminal, the UTF-8 translation has to be chosen. It is not convenient to change it during sessions.
  • After setting (set-keyboard-coding-system 'utf-8-unix), Emacs will receive UTF-8 encoded characters and translate them to whatever character encoding a buffer happens to use.
  • After setting (set-terminal-coding-system 'utf-8-unix), Emacs will use UTF-8 encoding when sending the characters to be displayed on the terminal, and this does not depend on the file encoding of the data.
  • Thus, the terminal coding system of Emacs can and should always be UTF-8, if the terminal program is in UTF-8 mode.
  • The keyboard coding system of Emacs can and should always be UTF-8 if the terminal program is in UTF-8 mode.
  • There are just two consistent combinations of terminal translation, Emacs keyboard coding and Emacs terminal coding: All UTF-8 or all Latin-1. Otherwise typing an "" might enter two characters "¤" into the Emacs buffer, or characters in Emacs buffer could be displayed incorrectly.
  • Any file which we want to keep in Latin1 encoding should have as its first line a comment including -*- coding: latin-1-unix -*-. This is never harmful, and helps when the default mode is different.
  • Any file which we want to keep in UTF-8 encoding should have as its first line a comment including -*- coding: utf-8-unix -*-. This is never harmful, and helps when the default mode is different.
  • We can handle both UTF-8 and Latin1 files during a single session of Emacs using the recipe in Answer 2.

(Who can find a better solution?)

-- KristerLinden - 11 Jan 2008, -- KimmoKoskenniemi - 2008-12-01, -- JackRueter - 2009-04-10

Topic revision: r8 - 2010-02-26 - KimmoKoskenniemi
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback