Alastair’s Place

Software development, Cocoa, Objective-C, life. Stuff like that.

Code-points Are a Red Herring

Having just read Matt Galloway’s article about Swift from an Objective-C developer’s perspective, I have a few things to say, but the most important of them is really nothing to do with Swift, but rather has to do with a common misunderstanding.

Let me summarise my conclusion first, and then explain why I came to it a long time ago, and why it’s relevant to Swift.

If you are using Unicode strings, they should (look like) they are encoded in UTF-16.

“But code-points!”, I hear you cry.

Sure. If you use UTF-16, you can’t straightforwardly index into the string on a code-point basis. But why would you want to do that? The only justification I’ve ever heard is based around the notion that code-points somehow correspond to characters in a useful way. Which they don’t.

Now, someone is going to object that UTF-16 means that all their English language strings are twice as large as they need to be. But if you do what Apple did in Core Foundation and allow strings to be represented in ASCII (or more particularly in ISO Latin-1 or any subset thereof), converting to UTF-16 on the fly at the API level is trivial.

What about UTF-8? Why not use that? Well, if you stick to ASCII, UTF-8 is compact. If you include ISO Latin-1, UTF-8 is never larger than UTF-16. The problem comes with code-points that are inside the BMP, but have code-point values of 0x800 and above. Those code-points take three bytes to encode in UTF-8, but only two in UTF-16. For the most part this affects Oriental and Indic languages, though Eastern European languages and Greek are affected to some degree, as is mathematics and various shape and dingbat characters.

So, first off, UTF-8 is not necessarily any smaller than UTF-16.

Second, and this is an important one too, UTF-8 permits a variety of invalid encodings that can create security holes or cause other problems if not dealt with. For instance, you can encode NUL (code-point 0) in any of the following ways:

c0 80
e0 80 80
f0 80 80 80

Some older decoders may also accept

f8 80 80 80 80
fc 80 80 80 80 80

Officially, only the first encoding (00) is valid, but you as a developer need to check for and reject the other encodings. Additionally, any encoding of the code-points d800 through dfff is invalid and should be rejected — a lot of software fails to spot these and lets them through.

Finally, if you start in the middle of a UTF-8 string, you may need to move a variable number of bytes to find the character you’re in, and you can’t tell in advance how many that will be.

For UTF-16, the story is much simpler. Once you’ve settled on the byte order, you really only need to watch out for broken surrogate pairs (i.e. use of d800 through dfff that doesn’t comply with the rules). Otherwise, you’re in pretty much the same boat as you would be if you’d picked UCS-4, except that in the majority of cases you’re using 2 bytes per code-point, and at most you’re using 4, so you never use more than UCS-4 would to encode the same string.

If you have a pointer into a UTF-16 string, you may at most need to move one code unit back, and that only happens if the code unit you’re looking at is between dc00 and dfff. That’s a much simpler rule than the one for UTF-8.

I can hear someone at the back still going “but code-points…”. So let’s compare code-points with what the end user things of as characters and see how we get on, shall we?

Let’s start with some easy cases:

0 - U+0030
A - U+0041
e - U+0065

OK, they’re straightforward. How about

é - U+00E9

Seems OK, doesn’t it? But it could also be encoded

é - U+0065 U+0301

Someone is now muttering about how “you could deal with that with normalisation”. And they’re right. But you can’t deal with this with normalisation:

ē̦ - U+0065 U+0304 U+0326

because there isn’t a precomposed variant of that character.

“Yeah”, you say, “but nobody would ever need that”. Really? It’s a valid encoding, and someone somewhere probably would like to be able to use it. Nevertheless, to deal with that objection, consider this:

בְּ - U+05D1 U+05B0 U+05BC

That character is in use in Hebrew. And there are other examples, too:

कू - U+0915 U+0942
कष - U+0915 U+0937

The latter case is especially interesting, because whether you see a single glyph or two depends on the font and on the text renderer that your browser is using(!)

The fact is that code-points don’t buy you much. The end user is going to expect all of these examples to count as a single “character” (except, possibly for the last one, depending on how it’s displayed to them on screen). They are not interested in the underlying representation you have to deal with, and they will not accept that you have any right to define the meaning of the word “character” to mean “Unicode code-point”. The latter simply does not mean anything to a normal person.

Now, sadly, the word “character” has been misused so widely that the Unicode consortium came up with a new name for the-thing-that-end-users-might-regard-as-a-unit-of-text. They call these things grapheme clusters, and in general they consist of a sequence of code-points of essentially arbitrary length.

Note that the reason people think using code-points will help them is that they are under the impression that a code-point maps one-to-one with some kind of “character”. It does not. As a result, you already have to deal with the fact that one “character” does not take up one code unit, even if you chose to use the Unicode code-point itself as your code unit. So you might as well use UTF-16: it’s no more complicated for you to implement, and it’s never larger than UCS-4.

It’s worth pointing out at this point that this is the exact choice that the developers of ICU (the Unicode reference implementation) and Java (whose string implementation derives from the same place) made. It’s also the choice that was made in Objective-C and Core Foundation. And it’s the right choice. UTF-8 is more complicated to process and is not, actually, smaller for many languages. If you want compatibility with ASCII, you can always allow some strings to be Latin-1 underneath and expand them to UTF-16 on the fly. UCS-4 is always larger and actually no easier to process because of combining character sequences and other non-spacing code-points.

Why is this relevant to Swift? Because in Matt Galloway’s article, it says:

Another nugget of good news is there is now a builtin way to calculate the true length of a string.

Only what Matt Galloway means by this is that it can calculate the number of code-points, which is a figure that is almost completely useless for any practical purpose I can think of. The only time you might care about that is if you were converting to UCS-4 and wanted to allocate a buffer of the correct size.