Mi blog lah! Το ιστολόγιό μου

14May/0816

Should UI strings in source code have non-ASCII characters?

There is a discussion going on at desktop-devel about whether the UI strings in the source code should also have non-ASCII characters. For example, should typical strings with double-quotes have those fancy Unicode double quotes?

printf(_("Could not find file “%s”n"));

instead of

printf(_("Could not find file "%s"n"));

The general view from the replies is to go ahead and add those nice Unicode characters.

Actually, there are UI messages already with non-ASCII characters (the ellipsis character, …) in GNOME 2.22:

  1. glade3
  2. epiphany

In GNOME 2.24, there are even more (with ellipsis):

  1. gucharmap
  2. epiphany
  3. gnome-terminal
  4. gedit
  5. glade3

Regarding the fancy Unicode double quotes, there are UI strings in GNOME 2.22 (same list for 2.24) in the following packages:

  1. evince
  2. cheese
  3. epiphany
  4. eog
  5. gnome-doc-utils

What are the arguments against having non-ASCII characters in UI strings?

  1. There might be systems that still use 8-bit legacy encodings. In this case, the UTF-8 encoded may not be displayed properly. However, when I tried to demonstrate this on my system (Ubuntu 8.04), I failed miserably. I downloaded a small GTK2 text editor (called tea), I changed a source UI string to include “” and ellipsis, compiled and installed. I then opened a shell, set LANG to POSIX (or C), and ran the text editor. The UI message was proper Unicode and I could even type non-ASCII in the text editor. I resorted to changing a system locale (I picked en_IN) to ISO-8859-1, then logged out. In the login screen it did not show the 8-bit encoding. If someone has a proper legacy 8-bit encoding system with GNOME (OpenBSD, FreeBSD, etc), could you please try it out?
  2. As Alan Cox mentioned in the thread, the canonical way to deal with UI strings in the source code should be to keep as ASCII, and put any fancy Unicode characters in the translation files (even for en_US, get an en_US translation file).

Is GNOME (or components) used in a legacy 7-bit/8-bit environment?

If there is any reason to keep UI strings in the source code as plain ASCII, speak now, or the Unicode flood gates are about to open.

Update 16 May 2008:There is a document at the ISO/IEC 9899 website (C programming language), that mentions the issue of character sets in C. It is http://www.open-std.org/jtc1/sc22/wg14/www/docs/C99RationaleV5.10.pdf.

On page 26, section 5.2.1, it says

The C89 Committee ultimately came to remarkable unanimity on the subject of character set requirements. There was strong sentiment that C should not be tied to ASCII, despite its heritage and despite the precedent of Ada being defined in terms of ASCII. Rather, an implementation is required to provide a unique character code for each of the printable graphics used by C, and for each of the control codes representable by an escape sequence. (No particular graphic representation for any character is prescribed; thus the common Japanese practice of using the glyph “¥” for the C character “” is perfectly legitimate.) Translation and execution environments may have different character sets, but each must meet this requirement in its own way. The goal is to ensure that a conforming implementation can translate a C translator written in C.

For this reason, and for economy of description, source code is described as if it undergoes the same translation as text that is input by the standard library I/O routines: each line is terminated by some newline character regardless of its external representation.

With the concept of multibyte characters, “native” characters could be used in string literals and character constants, but this use was very dependent on the implementation and did not usually work in heterogenous environments. Also, this did not encompass identifiers.

It then goes on with an addition to C99:

A new feature of C99: C99 adds the concept of universal character name (UCN) (see §6.4.3) in order to allow the use of any character in a C source, not just English characters. The primary goal of the Committee was to enable the use of any “native” character in identifiers, string literals and character constants, while retaining the portability objective of C.

Both the C and C++ committees studied this situation, and the adopted solution was to introduce a new notation for UCNs. Its general forms are unnnn and Unnnnnnnn, to designate a given character according to its short name as described by ISO/IEC 10646. Thus, unnnn can be used to designate a Unicode character. This way, programs that must be fully portable may use virtually any character from any script used in the world and still be portable, provided of course that if it prints the character, the execution character set has representation for it.

Of course the notation unnnn, like trigraphs, is not very easy to use in everyday programming; so there is a mapping that links UCN and multibyte characters to enable source programs to stay readable by users while maintaining portability. Given the current state of multibyte encodings,
10 this mapping is specified to be implementation-defined; but an implementation can provide the users with utility programs that do the conversion from UCNs to “native” multibytes or vice versa, thus providing a way to exchange source files between implementations using the UCN notation.

Update 7 Aug 2008: According to PEP 8, Style Guide for Python Code, under Encodings, says

    For Python 3.0 and beyond, the following policy is prescribed for
    the standard library (see PEP 3131): All identifiers in the Python
    standard library MUST use ASCII-only identifiers, and SHOULD use
    English words wherever feasible (in many cases, abbreviations and
    technical terms are used which aren't English). In addition,
    string literals and comments must also be in ASCII. The only
    exceptions are (a) test cases testing the non-ASCII features, and
    (b) names of authors. Authors whose names are not based on the
    latin alphabet MUST provide a latin transliteration of their
    names.

    Open source projects with a global audience are encouraged to
    adopt a similar policy.

(Emphasis mine)