Notes on UTF-8 and locales

Paul Heinlein
First published on July 19, 2004
Last updated on March 23, 2005

For some time now, the default shell environments shipped with many Linux distributions use UTF-8 (a.k.a. “Unicode”) locale information. This can be a bit confusing, especially for those accustomed to the old-style ASCII sorting order.

The starting point for documentation on the issue is the locale(1) man page. The locale command issued without arguments will provide a summary of your current environment.

$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE=C
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

For my purposes, two of these environment variables have more impact on my day-to-day work than the others. LC_CTYPE specifies “character classification and case conversion,” in other words, font information. LC_COLLATE influences sorting order.

Old-time sorting

If you’re accustomed to ASCII sorting, then the results of ls or sort might initially be confusing in a modern locale. Take a look at the difference the locale makes in these two directory listings. The first uses the old-time raw ASCII sort order. In the second, however, the locale “knows” that ‘C’ and ‘c’ are the same letter and that leading dots shouldn’t influence the sorting order.

$ LC_COLLATE="C" ls -a
.  ..  .CCC  .ccc  AAA  BBB  aaa  bbb
$ LC_COLLATE="en_US" ls -a
.  ..  aaa  AAA  bbb  BBB  .ccc  .CCC

Since I prefer the old sorting order, the first item of business was to alter LC_COLLATE in my shell environment. It appears I could achieve my desired results by setting it to a null value, “POSIX,” or “C.” I use the latter because that’s what the Fedora init scripts use.

# in my .profile script
LC_COLLATE="C"
export LC_COLLATE

UTF-8 fonts in xterms

The full explanation of getting unicode characters to display correctly in xterm windows is somewhat lengthy, but a quick-start recipe is pretty easy.

  • Download and save to disk Markus Kuhn’s UTF-8 demo text file.

    wget http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt
    
  • Make sure your LC_CTYPE environment variable is set to use UTF-8 locale-specific characters.

    LC_CTYPE="en_US.UTF-8"
    export LC_CTYPE
    
  • Invoke xterm using a ISO-10646-1 typeface.

    xterm -fn -misc-fixed-medium-r-normal--14-130-75-75-c-70-iso10646-1
    
  • Take a peek at the test file in a unicode-capable application like less.

    less UTF-8-demo.txt
    
  • Test your local man installation if you’d like to see larger blocks of text.

    # Greek's always fun
    LANG=el_GR.UTF-8 man man
    
    # German is widely supported
    LANG=de_DE.UTF-8 man man
    
    # As is Spanish
    LANG=es_ES.UTF-8 man man
    
    # How about something more exotic, like Hebrew or Korean?
    LANG=he_IL.UTF-8 man man
    LANG=ko_KR.UTF-8 man man
    

Testing UTF-8 support with GNU date

Here’s a little script that’ll print the locale-specific names of all the days and months for all the UTF-8 locales available on your system. It’ll allow you to see the locales for which you do and don’t have local font support.

#!/bin/bash
LANG=C
for loc in $(locale -a | grep utf8 | sort); do
  echo "Locale: $loc"
  # Aug 1, 2004 was a Sunday, Aug 7 a Saturday
  for n in $(seq 1 7); do
    LANG="$loc" date +"%A (%a)" -d 2004/8/${n}
  done
  for n in $(seq 1 12); do
    LANG="$loc" date +"%B (%b)" -d 2004/${n}/1
  done
  echo
done

You might also try saving the script’s output to a file and then viewing that file with a web browser. On many of my systems, the browsers have better UTF-8 support than xterm and its system font.

Useful links

A great starting place for UTF-8/Linux information is Markus Kuhn’s UTF-8 and Unicode FAQ for Unix/Linux. Markus is also the author of the helpful unicode(7) and utf-8(7) man pages that are found on many Linux systems.

Other helpful pages include Using UTF-8 with Gentoo, The Unicode HOWTO at the Linux Documentation Project, and Jan Stumpel’s UTF-8 on Linux.

Linux  Howto