Alastair’s Place

Software development, Cocoa, Objective-C, life. Stuff like that.

Why Not Unicode in Identifiers?

Earlier today, Graham Lee linked on Twitter to a piece by Poul-Henning Kamp about the “tyranny of ASCII” in programming language syntax.

Kamp’s contention is that we should be free to use (for instance) “Dentistry symbol light down and horizontal with wave” (U+23C7, ‘⏇’ if your browser has it) as an identifier in a program if we so choose. Or, perhaps more reasonably, Ω₀.

It’s certainly an appealing idea, especially to anyone who has ever attempted to implement a mathematical algorithm, or even an otherwise non-mathematical algorithm that has come from an academic paper (which tend to use mathematical notation).

It is, perhaps, less than obvious what dangers await the unwary in this area; I think by now many people are familiar with the confusability of various glyphs, but perhaps it is not so obvious that e.g. ‘a’ and ‘а’ are in fact entirely different characters (the second is Cyrillic). Nor is it obvious to the uninitiated that ‘é’ differs in any way from ‘é’, or that care must be taken when comparing strings as a result, let alone more problematic character equivalences like ‘ß’ and “sz”/“ss”.

If we are going to propose allowing Unicode in identifiers, then, we need to specify:

  • Which code points are and are not going to be allowed? e.g. Do we allow combining marks? Spacing modified letters? Private use characters?

  • Do we allow bi-directional text? What about embedded bidi? If so, we can support Hebrew and Arabic, but matching identifiers is going to get complicated really quickly (the characters might be in either order, depending on the Unicode bidi rules)

  • Do we allow identifiers to consist of characters from different scripts? For instance, is “аnd” a valid identifier? It isn’t the same as “and”…

  • How are the compiler and linker going to determine a symbol table match? Does “æ” match “ae”? Does “ï” match “i”? What about “ß” and “ss”?

  • What about the system linker (probably most important on systems with dynamic linking support)? Name mangling might be a solution, but we get exactly one chance to get that right before creating ABI compatibility problems. C++ didn’t do so well at that, as I recall.

Additionally, because different people have different fonts on their systems and not all code points necessarily have glyphs in all of (or even any of) those fonts, we perhaps need to think what will happen if a developer opens a source file on a machine that is lacking some glyphs that are needed to render an identifier. How is that displayed to them? Is it just a matter for their text-editor? Maybe so, but until text editors have a good way to deal with this situation, it’s potentially a bigger problem for the developer themselves.

Another concern is that, particularly in the case of Arabic, Indic and Far-Eastern languages, just typing “characters” on your keyboard can be quite an involved process. Certainly speakers of those languages may be familiar with the input methods they need to use, but that is not true generally, and if we permit identifiers containing such characters we risk fragmenting the developer community along natural language boundaries — people will write code that can only easily be used by a native user of their particular script. ASCII may be a poor subset of Unicode, but it’s very unlikely that any computer user anywhere on the planet is unable to type the majority of its characters. Unlike (say) allowing CJK characters in code, which means that anyone unfamiliar with the script will be relegated to using Cut and Paste (it isn’t even feasible in that particular case to attempt to locate the character by looking through a Unicode character palette — there are many thousands of different CJK characters).

Even if we restrict ourselves to (say) Latin characters plus accents, there’s still plenty of potential for confusion; just look at ‘ë’ and ‘e̎’ or ‘ç’, ‘c̡’, ‘c̦’.

For my money, then, allowing arbitrary Unicode identifiers is a mistake. I have nothing against the use of arbitrary Unicode in comments and string constants, but in the core syntax of a programming language more care is necessary in order to avoid creating problems.

Of course, some people retort that the issues I raise could be addressed by means of policies of individual projects. That’s true to a degree, though it still invites fragmentation of the developer community (which, I contend, is highly undesirable), and it doesn’t really address problems such as that of deciding on the equivalence of different characters/identifiers.

Personally, while I’m in favour, in principle, of expanding the set of characters we allow in programming languages, I think it needs to be done carefully and with considerable thought.