Notes on UTF-8 and locales
Paul Heinlein
First published on July 19, 2004
Last updated on March 23, 2005
For some time now, the default shell environments shipped with many Linux distributions use UTF-8 (a.k.a. “Unicode”) locale information. This can be a bit confusing, especially for those accustomed to the old-style ASCII sorting order.
The starting point for documentation on the issue is the locale(1) man
page. The locale
command issued without arguments will provide a
summary of your current environment.
$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE=C
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
For my purposes, two of these environment variables have more impact on my day-to-day work than the others. LC_CTYPE specifies “character classification and case conversion,” in other words, font information. LC_COLLATE influences sorting order.
Old-time sorting
If you’re accustomed to ASCII sorting, then the results of ls
or
sort
might initially be confusing in a modern locale. Take a look at
the difference the locale makes in these two directory listings. The
first uses the old-time raw ASCII sort order. In the second, however,
the locale “knows” that ‘C’ and ‘c’ are the same letter and that leading
dots shouldn’t influence the sorting order.
$ LC_COLLATE="C" ls -a
. .. .CCC .ccc AAA BBB aaa bbb
$ LC_COLLATE="en_US" ls -a
. .. aaa AAA bbb BBB .ccc .CCC
Since I prefer the old sorting order, the first item of business was to alter LC_COLLATE in my shell environment. It appears I could achieve my desired results by setting it to a null value, “POSIX,” or “C.” I use the latter because that’s what the Fedora init scripts use.
# in my .profile script
LC_COLLATE="C"
export LC_COLLATE
UTF-8 fonts in xterms
The full explanation of getting unicode characters to display correctly
in xterm
windows is somewhat lengthy, but a quick-start recipe is
pretty easy.
-
Download and save to disk Markus Kuhn’s UTF-8 demo text file.
wget http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt
-
Make sure your LC_CTYPE environment variable is set to use UTF-8 locale-specific characters.
LC_CTYPE="en_US.UTF-8" export LC_CTYPE
-
Invoke
xterm
using a ISO-10646-1 typeface.xterm -fn -misc-fixed-medium-r-normal--14-130-75-75-c-70-iso10646-1
-
Take a peek at the test file in a unicode-capable application like
less
.less UTF-8-demo.txt
-
Test your local
man
installation if you’d like to see larger blocks of text.# Greek's always fun LANG=el_GR.UTF-8 man man # German is widely supported LANG=de_DE.UTF-8 man man # As is Spanish LANG=es_ES.UTF-8 man man # How about something more exotic, like Hebrew or Korean? LANG=he_IL.UTF-8 man man LANG=ko_KR.UTF-8 man man
Testing UTF-8 support with GNU date
Here’s a little script that’ll print the locale-specific names of all the days and months for all the UTF-8 locales available on your system. It’ll allow you to see the locales for which you do and don’t have local font support.
#!/bin/bash
LANG=C
for loc in $(locale -a | grep utf8 | sort); do
echo "Locale: $loc"
# Aug 1, 2004 was a Sunday, Aug 7 a Saturday
for n in $(seq 1 7); do
LANG="$loc" date +"%A (%a)" -d 2004/8/${n}
done
for n in $(seq 1 12); do
LANG="$loc" date +"%B (%b)" -d 2004/${n}/1
done
echo
done
You might also try saving the script’s output to a file and then viewing that file with a web browser. On many of my systems, the browsers have better UTF-8 support than xterm and its system font.
Useful links
A great starting place for UTF-8/Linux information is Markus Kuhn’s UTF-8 and Unicode FAQ for Unix/Linux. Markus is also the author of the helpful unicode(7) and utf-8(7) man pages that are found on many Linux systems.
Other helpful pages include Using UTF-8 with Gentoo, The Unicode HOWTO at the Linux Documentation Project, and Jan Stumpel’s UTF-8 on Linux.