Alastair’s Place

Software development, Cocoa, Objective-C, life. Stuff like that.

Code-points Are a Red Herring

Having just read Matt Galloway’s article about Swift from an Objective-C developer’s perspective, I have a few things to say, but the most important of them is really nothing to do with Swift, but rather has to do with a common misunderstanding.

Let me summarise my conclusion first, and then explain why I came to it a long time ago, and why it’s relevant to Swift.

If you are using Unicode strings, they should (look like) they are encoded in UTF-16.

“But code-points!”, I hear you cry.

Sure. If you use UTF-16, you can’t straightforwardly index into the string on a code-point basis. But why would you want to do that? The only justification I’ve ever heard is based around the notion that code-points somehow correspond to characters in a useful way. Which they don’t.

Now, someone is going to object that UTF-16 means that all their English language strings are twice as large as they need to be. But if you do what Apple did in Core Foundation and allow strings to be represented in ASCII (or more particularly in ISO Latin-1 or any subset thereof), converting to UTF-16 on the fly at the API level is trivial.

What about UTF-8? Why not use that? Well, if you stick to ASCII, UTF-8 is compact. If you include ISO Latin-1, UTF-8 is never larger than UTF-16. The problem comes with code-points that are inside the BMP, but have code-point values of 0x800 and above. Those code-points take three bytes to encode in UTF-8, but only two in UTF-16. For the most part this affects Oriental and Indic languages, though Eastern European languages and Greek are affected to some degree, as is mathematics and various shape and dingbat characters.

So, first off, UTF-8 is not necessarily any smaller than UTF-16.

Second, and this is an important one too, UTF-8 permits a variety of invalid encodings that can create security holes or cause other problems if not dealt with. For instance, you can encode NUL (code-point 0) in any of the following ways:

00
c0 80
e0 80 80
f0 80 80 80

Some older decoders may also accept

f8 80 80 80 80
fc 80 80 80 80 80

Officially, only the first encoding (00) is valid, but you as a developer need to check for and reject the other encodings. Additionally, any encoding of the code-points d800 through dfff is invalid and should be rejected — a lot of software fails to spot these and lets them through.

Finally, if you start in the middle of a UTF-8 string, you may need to move a variable number of bytes to find the character you’re in, and you can’t tell in advance how many that will be.

For UTF-16, the story is much simpler. Once you’ve settled on the byte order, you really only need to watch out for broken surrogate pairs (i.e. use of d800 through dfff that doesn’t comply with the rules). Otherwise, you’re in pretty much the same boat as you would be if you’d picked UCS-4, except that in the majority of cases you’re using 2 bytes per code-point, and at most you’re using 4, so you never use more than UCS-4 would to encode the same string.

If you have a pointer into a UTF-16 string, you may at most need to move one code unit back, and that only happens if the code unit you’re looking at is between dc00 and dfff. That’s a much simpler rule than the one for UTF-8.

I can hear someone at the back still going “but code-points…”. So let’s compare code-points with what the end user things of as characters and see how we get on, shall we?

Let’s start with some easy cases:

0 - U+0030
A - U+0041
e - U+0065

OK, they’re straightforward. How about

é - U+00E9

Seems OK, doesn’t it? But it could also be encoded

é - U+0065 U+0301

Someone is now muttering about how “you could deal with that with normalisation”. And they’re right. But you can’t deal with this with normalisation:

ē̦ - U+0065 U+0304 U+0326

because there isn’t a precomposed variant of that character.

“Yeah”, you say, “but nobody would ever need that”. Really? It’s a valid encoding, and someone somewhere probably would like to be able to use it. Nevertheless, to deal with that objection, consider this:

בְּ - U+05D1 U+05B0 U+05BC

That character is in use in Hebrew. And there are other examples, too:

कू - U+0915 U+0942
कष - U+0915 U+0937

The latter case is especially interesting, because whether you see a single glyph or two depends on the font and on the text renderer that your browser is using(!)

The fact is that code-points don’t buy you much. The end user is going to expect all of these examples to count as a single “character” (except, possibly for the last one, depending on how it’s displayed to them on screen). They are not interested in the underlying representation you have to deal with, and they will not accept that you have any right to define the meaning of the word “character” to mean “Unicode code-point”. The latter simply does not mean anything to a normal person.

Now, sadly, the word “character” has been misused so widely that the Unicode consortium came up with a new name for the-thing-that-end-users-might-regard-as-a-unit-of-text. They call these things grapheme clusters, and in general they consist of a sequence of code-points of essentially arbitrary length.

Note that the reason people think using code-points will help them is that they are under the impression that a code-point maps one-to-one with some kind of “character”. It does not. As a result, you already have to deal with the fact that one “character” does not take up one code unit, even if you chose to use the Unicode code-point itself as your code unit. So you might as well use UTF-16: it’s no more complicated for you to implement, and it’s never larger than UCS-4.

It’s worth pointing out at this point that this is the exact choice that the developers of ICU (the Unicode reference implementation) and Java (whose string implementation derives from the same place) made. It’s also the choice that was made in Objective-C and Core Foundation. And it’s the right choice. UTF-8 is more complicated to process and is not, actually, smaller for many languages. If you want compatibility with ASCII, you can always allow some strings to be Latin-1 underneath and expand them to UTF-16 on the fly. UCS-4 is always larger and actually no easier to process because of combining character sequences and other non-spacing code-points.

Why is this relevant to Swift? Because in Matt Galloway’s article, it says:

Another nugget of good news is there is now a builtin way to calculate the true length of a string.

Only what Matt Galloway means by this is that it can calculate the number of code-points, which is a figure that is almost completely useless for any practical purpose I can think of. The only time you might care about that is if you were converting to UCS-4 and wanted to allocate a buffer of the correct size.

Async in Swift

You may have seen this piece I wrote about implementing something like C#’s async/await in Swift. While that code did work, it suffers from a couple of problems relative to what’s available in C#. The first problem is that it only supports a single return type, Int, because of a problem with the current version of the Swift compiler.

The second problem is that you can’t use it from the main thread in a Cocoa or Cocoa Touch program, because await blocks.

As I mentioned previously on Twitter, to make it work really well involves some shennanigans with the stack. Anyway, I’m pleased to announce that I’ve been merrily hacking away and as a result you can download a small framework project that implements async/await from BitBucket.

I’m quite pleased with the syntax I’ve managed to construct for this as well; it looks almost as if it’s a native language feature:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
let task = async { () -> () in
  let fetch = async { (t: Task<NSData>) -> NSData in
    let req = NSURLRequest(URL: NSURL.URLWithString("http://www.google.com"))
    let queue = NSOperationQueue.mainQueue()
    var data = NSData!
    NSURLConnection.sendAsynchronousRequest(req,
                                            queue:queue,
      completionHandler:{ (r: NSURLResponse!, d: NSData!, error: NSError!) -> Void in
        data = d
        Async.wake(t)
      })
    Async.suspend()
    return data!
  }

  let data = await(fetch)
  let str = NSString(bytes: data.bytes, length: data.length,
                     encoding: NSUTF8StringEncoding)

  println(str)
}

Now, to date I haven’t actually tried it on iOS; I think it should work, but it’s possible that it will crash horribly. It is certainly working on OS X, though.

How does it work? Well, behind the scenes, when you use the async function, a new (very small) stack is created for your code to run in. The C code then uses _setjmp() and _longjmp() to switch between different contexts when necessary. If you want to cringe slightly now, be my guest :–)

Possible improvements when I get the time:

  • Reduce the cost of async invocation by caching async context stacks
  • Once Swift is fixed, remove the T[] hack that we’re using instead of declaring the result type in the Task<T> object as T?. The latter presently doesn’t work because of a compiler limitation.

C#-like Async in Swift

Justin Williams was wishing for C#-like async support in Swift. I think it’s possible to come up with a fairly straightforward implementation in Swift, without any changes to the compiler, and actually without any hacking either. (If it weren’t for compiler bugs, the code below would be more than just a toy implementation too…)

Anyway, here goes:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import Dispatch

var async_q : dispatch_queue_t = dispatch_queue_create("Async queue",
  DISPATCH_QUEUE_CONCURRENT)

/* If generics worked, we'd use Task<T> here and result would be of type T? */
class Task {
  var result : Int?
  var sem : dispatch_semaphore_t = dispatch_semaphore_create(0)
    
  func await() -> Int {
    dispatch_semaphore_wait(sem, DISPATCH_TIME_FOREVER)
    return result!
  }
}

func await(task: Task) -> Int {
  return task.await()
}

func async(b: () -> Int) -> Task {
  var r = Task()
  
  dispatch_async(async_q, {
    r.result = b()
    dispatch_semaphore_signal(r.sem)
    })
  
  return r
}

/* Now use it */
func Test2(var a : Int) -> Task { return async {
  sleep(1)
  return a * 7
  }
}

func Test(var a : Int) -> Task { return async {
  var t2 = Test2(a)
  var b = await(t2)
  
  return a + b
  }
}

var t = Test(100)

println("Waiting for result")

for n in 1..10 {
  println("I can do work here while the function works.")
}

var result = await(t)

println("Result is available")

Now, obviously if Swift supported continuations, this might be done more efficiently (i.e. without any background threads or semaphores), but that’s an implementation detail.

There are also some syntax changes that would make it cleaner, notably if it was permissible to remove the { return and } from the async function declarations. I did briefly try to see whether I was allowed to assign to a function, ala

func Test(var a : Int) -> Task = async { }

but that syntax isn’t allowed (if it was, async would obviously need to return a block).

1st January 2015 VAT Changes

On the 1st of January 2015, some changes to European Union law come into force that significantly affect the way that VAT works for “electronic services” delivered to consumers. The laws in question were actually changed back in 2008, but because of obstruction from some member states that benefit from the status quo, the date at which they came into effect was pushed back by six years.

If you are a software developer selling software in the European Union, these changes matter to you. There has been very little publicity thus far about these changes (that will change as we get closer to the end of the year), but given that you may need to make changes to your website, it seems like a good idea to tell you about them now.

So, what’s changing? Currently, if you are established in the European Union and you sell downloadable software to a customer who is also in the European Union, you always charge VAT in your country, following the rules in your country, and you pay it to the tax authority in your country. This is simple, because there is only one set of rules to follow, and it’s the one for your country.

As of the 1st of January, the VAT will instead be due in the customer’s country. If there were no other changes to the rules, you would therefore be obliged to register for VAT in other member states, according to their rules, and submit multiple returns every quarter (or at whatever period they specify). That means you might have to register with up to 28 member states, apply 28 different rates, 28 different sets of rules, make 28 times as many VAT returns and 28 separate payments in difference currencies (with currency conversions and rounding following different rules in different jurisdictions). For a small software company or an independent developer, this is clearly not going to work.

There are two other changes that are also coming in at the same time that mitigate this problem. The first is that app stores will be responsible for charging and remitting consumer VAT. Apple already does this, but some other app stores may not. Under the new rules, they will have to, so you will only have to deal with VAT as it applies to transactions between you and the app store provider.

If you sell direct to consumers, that doesn’t really help, though. What will help is that EU member states are going to operate a system known as the Mini One Stop Shop (or MOSS for short). This is similar to the scheme that has been operating for businesses outside of the EU selling to EU customers, whereby you can register with a single tax authority, submit a single return to that tax authority, and pay all of the tax due to that one place. You are still required to charge VAT at the rate applicable in the customer’s country, and in various respects the rules in that country will still apply — with some simplifications. Registration for this new scheme starts in October, and, unless you plan on only selling via an app store, you will probably want to register for it.

The other slight complication is that after 1st of January, you will need to keep two non-conflicting pieces of evidence to identify the location of your customer. HMRC has indicated, at least in the case of the U.K., that they will be fairly relaxed about this evidence — so, for instance, they realise that IP geolocation may not be 100% accurate, and that some customers may lie and give you false details. It also does not matter if you have more data that conflicts with your two non-conflicting pieces of evidence; all you need is those two. However, this affects all of your sales, not just those to customers in the EU, since it applies equally to your decision not to charge VAT to customers because they are not in any EU member state.

Why am I telling you about this? Because I’m a member of H.M. Revenue and Customs’ MOSS Joint SME Business/HMRC Working Group. Those of you who are in the UK, if you have queries about the scheme, or issues you would like to raise with HMRC, please do get in touch and I’ll try to help out. (If you are a member of TIGA, they have a couple of representatives on the working group also, so you can talk to them too.)

Finally, I will add that the law changes are already made — back in 2008 — so the scope for changing the rules at this stage is very limited. What we can influence to some extent is how they’re enforced and whether HMRC is aware of problems the new rules may cause us.

I’ll be posting some more on this topic over the coming weeks and months.

Dmgbuild - Build ‘.dmg’ Files From the Command Line

I’ve just released a new command line tool, dmgbuild, that automates the creation of (nice looking) disk images from the command line. There are no GUI tools necessary; there is no AppleScript, and it doesn’t rely on Finder, or on any deprecated APIs.

Why use this approach? Well, because everything about your disk image is defined in a plain text file, you’ll get the same results every time; not only that, but the resulting image will be the same no matter what version of Mac OS X you build it on.

If you’re interested, the Python package is up on PyPI, so you can just do

pip install dmgbuild

to get the program (if you don’t have pip, do easy_install pip first; or download it from PyPI, extract it, then run python setup.py install). You can also read the documentation, or see the code.

It’s really easy to use; all you need do is make a settings file (see the documentation for an example) then from the command line enter something like

dmgbuild -s my-settings.py "My Disk Image" output.dmg

The code for editing .DS_Store files and for generating Mac aliases has been split out into two other modules, ds_store and mac_alias, for those who are interested in such things. The ds_store module should be fully portable to other platforms; the mac_alias module relies on some OS X specific functions to fill out a proper alias record, and on other systems those would need to be replaced somehow. The dmgbuild tool itself relies on hdiutil and SetFile, so will only work on Mac OS X.

Bit-rot and RAID

There’s an interesting article on Ars Technica about next-generation filesystems, which mentions something it calls “bit rot” — allegedly the “silent corruption of data on disk or tape”.

Is this a thing? Really? Well, no, not really.

Very early on, disks and tapes were relatively unreliable and so there have basically always been checksums of some description to let you know if data you read is corrupted. Historically, we’re talking about some kind of per-block cyclic redundancy check, which is why one of the error codes you can receive at a disk hardware interface is “CRC error”.

Modern disks actually use error correcting codes such as Reed-Solomon Encoding or Low-Density Parity Check codes. A single random bit error under such schemes can be corrected, end of story. They may be able to correct multiple bit errors too, and these codes can detect more errors than they are able to correct.

The upshot is that a single bit flip on a disk surface won’t cause a read error; in fact, the software in your computer won’t even notice it because the hard disk will correct it and rewrite the data on its own.

It takes multiple flipped bits to cause a problem, an in most cases this will result in the drive reporting a failure to the operating system when trying to read the block in question. The probability of a multi-bit failure that can get past Reed-Solomon or LDPC codes is tiny.

The author then goes on to make a ludicrous claim that RAID won’t be able to deal with this kind of event, and “demonstrates” by flipping “a single bit” on one of his disks to make his point. Unfortunately, this is a completely bogus test. He has, in fact, flipped at many more bits than just the one, and he’s done so by writing to the disk, which will encode his data using its error correcting code, resulting in a block that reads correctly because he’s actually stored the wrong data there deliberately.

The fact is that, in practice, when an unrecoverable data corruption occurs on a disk surface, the disk returns an error when something tries to read that block. If a RAID controller gets such an error, it will attempt to rebuild the data using parity (or whatever other redundancy mechanism it’s using).

So RAID really does protect you from changes that occur on the disk itself.

Where RAID does not protect you is on the computer side of the equation. It doesn’t prevent random bit flips in RAM, or in the logic inside your machine. Some components in some computers have their own built-in protection against these events — for instance, ECC memory uses error correcting codes to prevent random bit errors from corrupting data, while some data busses themselves use error correction. If you are seeing random bit flips in files that otherwise read OK, it’s much more likely they were introduced in the electronics or even via software bugs and written in their corrupted form to your storage device.

An aside: programmers generally use the term “bit rot” to refer to the fact that unmaintained code will often at some point stop working because of apparently unrelated changes in other parts of a large program. Such modules are said to be suffering from “bit rot”. I’ve never heard it used in the context of data storage before.

Sexual Harrassment and Developers

Today I read yet another awful tale from a female attendee at a conference. This one was worse than average because it includes an actual sexual assault and involved the lady in question’s own boss, co-workers and friends. What happened to this lady, and numerous others, is awful.

So what’s this post about? Well, two things really.

The first has been annoying me for some time; when this kind of incident happens, certain commentators come out and tell us that there is some kind of widespread problem of sexism in the developer community and that that is the cause of these incidents. This is, in my experience (which we’ll come to in a bit), untrue. There is a problem of sexism with some people, both male and female, and some of them happen to be developers. It might seem to some female developers as if the problem is widespread, but as this excellent piece from Ian Gent points out, the fact that there are relatively few female developers means that they will (unfortunately) see a hugely disproportionate number of incidents. You might say that means that it’s important to do something about the people causing them, and I’d agree with you. It is, however, a long way from being a “widespread problem”.

On the other hand, and this is important too, sexism can be quite subjective. Not everybody agrees on (for instance) what remarks are and are not acceptable, and sometimes even when two people both think something is inappropriate, they may cite entirely different reasons (for instance, the TechCrunch idiocy is, IMO, stupid, unprofessional and crass, as opposed to sexist), and there have also been incidents where not everybody agrees that the line was even crossed. That last one was presented by many people (including some I follow on Twitter) as an example of sexism, and vigorously defended as such, but actually for many of us, the nasty aspect of it was that someone was fired for making a joke, privately, to his colleague. OK, a mildly blue joke, but sacking them for it was wildly disproportionate, and it backfired horribly on the lady who complained about it too. There’s a good summary on Venturebeat.

The second thing is an observation. Male software developers are a fairly socially inept group, and a reasonable proportion have had relatively little interaction with the opposite sex. This is not to excuse bad behaviour, but it is perhaps worth being aware, if you are a woman attending a tech conference or working in a tech company, that the people around you have more in common with Sheldon Cooper than with George Clooney, and so you might need to make it quite obvious if someone is doing something you don’t like. The poor lady who wrote the post I linked to above said at one point that her boss was kissing her but she “didn’t reciprocate”. That may very well not be sufficient. Remember, some of the people at a tech conference may never have been kissed. They don’t know what it’s supposed to feel like (and, actually, people do do it differently — some people really don’t kiss back very much, whereas with others it’s really obvious they’re into it). If someone is doing something you don’t want, tell them, CLEARLY. Don’t try to save their feelings, don’t try to be subtle; they may not get it.

Now, I’m sure the same people who feed us the widespread sexism meme will get very annoyed at what I just said, and bang on about how it’s never acceptable and the woman needs to say “yes” and so on. Look, we all know what the law has to say, but unless you’re spectacularly naïve and inexperienced you also know that that isn’t quite how the real world works. It isn’t even quite how most of us want the real world to work… most of the people I’ve kissed or been kissed by didn’t look me in the eye and go “Yes!” — it just happened — and nobody wants to sign a ten-page form before coming within two feet of each other.

The trouble is that implied consent, which actually is the norm, is difficult to convey and difficult to interpret; even more so if you’re socially inept and drunk. What is clear, even to someone totally socially inept, even when drunk, is a clear “No”, coupled if necessary — and hopefully it won’t be — with physical resistance. This also has the benefit of making it obvious to anyone else present that there is a problem. Most people will intervene to help you, if they know there is something wrong. If, on the other hand, something is happening and you aren’t doing anything to stop it or giving any indication of distress they will probably regard it as none of their business.

“But you don’t know what you’re talking about”. I can hear it already. Well, actually:

  1. I’ve had to fend off unwanted advances, from men too (and yes, men are more aggressive sometimes than women). So yes, I’ve been on the receiving end, as it were. Being clear about it works.

  2. When I was at school, I managed to make the life of one of the girls in my class utterly miserable. I didn’t mean to — it was the last thing I wanted — but, partly because I didn’t understand that she was not and would never be interested in me, I upset her.

    To be clear, I never touched her, so we’re certainly not talking about anything like the incident I linked to above, but I needed to be told that I was upsetting her. Looking back, I can see that she tried to be subtle about it, but as a teenager I was about as socially inept as they come and there was really no chance of me “getting it”. Had she clearly told me that I was upsetting her and asked me to stop behaving the way I was, much heartache could have been saved for both of us.

    As it was, she burst into tears on her way home, I heard about it second hand, and I spent the next five days pretty much in tears the whole time about how I’d hurt her.

    I’m ashamed to this day of this episode in my life, and it put me off even trying to have any kind of relationship for quite some time afterwards, just in case I got things wrong again and upset someone.

So I do have some perspective on this issue.

Let me re-iterate: if someone is doing something you don’t like, please TELL THEM. It’s no good letting them carry on doing it and expecting that they will notice that you aren’t kissing them back, or that you’re ignoring them, or some other such thing. Also, go for clarity. “No” is good; it’s simple, to the point and should have the desired effect.

Weev

There are numerous articles floating around the ’Net deploring the fact that the notoriously unpleasant Andrew Auernheimer (aka “weev”) has been locked up for stealing the e-mail addresses of some 114,000 different AT&T iPad 3G customers.

The most recent one to grab my attention was a piece by Errata Security’s Robert Graham which appears to intentionally misinterpret various aspects of the U.S. Government’s case against weev in order to support the notion that the CFAA (the U.S. equivalent of the U.K.’s Computer Misuse Act) could be used to arbitrarily prosecute anybody, and makes various arguments as to why the prosecutors’ case amounts to a “liberal reinterpretation of the rules of the Internet” to find weev in violation.

Unfortunately, as with most such claims I have seen, Graham has confused the technical and legal arguments in order to arrive at the conclusion he wants to reach.

First, he discusses “User-agent rules”, implying that because the User-Agent header is not intended to be used as a means of identification, the fact that Auernheimer’s accomplice, Spitler, spoofed the iPad’s User-Agent string should be disregarded. What he fails to mention is that while the User-Agent is not supposed to be being used as a means of identification, AT&T was in fact using it that way, and both of them knew this to be the case, since they knew that the AT&T server would not respond unless the User-Agent string was set to something matching that provided by an iPad.

Second, he talks about the URL, and in particular the fact that the URL is visible to end-users in ordinary web browsers and that it is also generally modifiable by end-users. Again, the facts surrounding this particular case are omitted in favour of generalisations.

In particular, Graham’s argument fails to note that the mere existence of a web server on a particular machine most certainly does not imply that anyone anywhere is authorised to access any URL at that address. Even if some URLs at a particular address are public, there may well be private URLs that are unintentionally accessible; that does not mean that you are authorised to use them.

Moreover, the URL at issue was not public — it formed part of the carrier settings for AT&T supplied to iPad users in encrypted form and was never presented in the clear to those end users for their use. In order to obtain it in the first place, Spitler had to decrypt the iOS binary (which he was not in any case entitled to be in possession of, as he did not have an iPad), and then had to hunt through the resulting data for likely URLs.

While there is a legitimate worry that CFAA or the Computer Misuse Act could end up being applied to a normal Internet user who had merely replaced ?articleID=12 with ?articleID=13, that is very much not what happened in this case. Rather, the secret URL in question made use of the device’s ICC-ID to allow AT&T to pre-fill the user’s e-mail address in a log-in form. The ICC-ID is a secret number shared between the device and the network operator; it is a bit like a credit card number, both in length and format, and in that the number is very likely visible on the SIM card and/or packaging. It is also like a credit card number in that it is possible to generate valid numbers — and in this particular instance, Spitler and Auernheimer did so based on the knowledge that the ICC-IDs issued to AT&T iPads were sequential in nature.

Note again — the mere fact that it is possible to generate valid numbers that are not yours does not entitle you to use those numbers to authenticate yourself, either to a bank in the case of credit card numbers, or in this instance to AT&T’s network. Nor does the fact that both credit card numbers and ICC-IDs provide relatively poor security guarantees mean that you are entitled to take advantage of that fact.

The prosecution has not, as Graham claims, insisted that mere editing of a URL is illegal. The unlawful act is the unauthorised access, and there is no question that both Spitler and Auernheimer knew that they were not authorised to harvest e-mail addresses from AT&T’s servers in this manner (indeed, as the prosecution points out, Auernheimer referred to it as “the theft” in an e-mail to a journalist).

Graham then goes on to worry that “legitimate” security researchers may be in fact no different to Auernheimer, which is a bit of a stretch. White hat security researchers, who ask permission (or are contracted by the target) before attempting any kind of exploit, could not find themselves in the mess Auernheimer finds himself in. Even grey hats, who may not ask permission first, would at least have told their target about the security hole and given them the opportunity to repair it. Auernheimer did not do that. Instead, he attempted to use the security breach for his own personal gain (in this case to promote himself and his security company), and purposely did not inform AT&T prior to handing the e-mail addresses over to the press.

This is not, as some are suggesting, a case of the U.S. Government making an “arbitrary decision that you’ve violated the CFAA”. That conclusion is simply not supported by the evidence.

Should Auernheimer be locked up? Undoubtedly, if not for this then for any number of other things he’s done in the past. What about for this specific offence? Yes, I think he should. It seems to me unarguable that AT&T did not intend for anyone to be able to garner a list of e-mail addresses from their server. It also seems unarguable that Auernheimer and Spitler knew this — i.e. they knew that their activity would constitute unauthorised misuse. Auernheimer is certainly unpleasant but he is not an idiot, and he must also have been aware that there was a risk that law-enforcement might take issue with his activities. There is simply no way to reconcile these facts with the notion that he is somehow being victimised for doing something that is in any way reasonable.

Sad Times

This is my second attempt at writing about a subject that I find very difficult.

The summary is that my company, Coriolis Systems is not doing so well. That’s not to say that we’re going anywhere any time soon, but the fact is that as SSDs have become more prevalent, there’s less and less need for our existing products.

This is particularly true for iDefrag, our best-selling product and the one that generates most of the company’s revenue. There’s usually no need to defragment SSDs.

The situation hasn’t been helped by the rise of the Mac App Store, from which all of our current products and even one or two of those we have under development are barred because they need low-level access to the system. New users who want software are very likely to use the App Store to find it, and we can’t be in there. On the other hand, products with misleading names like “Disk Doctor” appear to be very much welcome (it’s in “Top Paid”, and it very definitely is not a disk repair tool, which is what its name implies it should be).

To help illustrate, here’s a graph showing our sales, month on month, from when the company started to the end of the last financial year.

The result of this is that, at the end of this month, I’m having to make two of my staff, Ed Warrick and James Snook, redundant, and we’re also leaving the office premises we’ve been using for the past five years. Ed and James are really good guys and it’s a real blow to lose them; you’d be a fool not to hire either of them, quite frankly.

Encapsulation in C

This comes up again and again, and I’ve seen various bad advice given, even in textbooks, so I thought I’d write a quick post about it. People are increasingly familiar with OO languages and perhaps not so familiar with plain old C, so when faced with having to drop down to pure C for some reason it’s quite common to ask how to achieve some of the design patterns they’re familiar with from OO, such as encapsulation, data hiding and the like.

What do I mean? Well, the canonical example where this kind of thing is useful is data structures, so let’s imagine that we want to make a library of functions that implements a key-value map. We might start with a header file containing something like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
typedef struct {
  void *key;
  void *value;
} kvmap_entry_t;

typedef struct {
  unsigned size, count;
  kvmap_entry_t *entries;
  void          *userdata;
} kvmap_t;

typedef int (*kvmap_key_comparator_t)(const void *k1, const void *k2,
                                      void *userdata);

kvmap_t *kvmap_create (kvmap_key_comparator_t compare_keys, void *userdata);

void kvmap_destroy (kvmap_t *map);

void kvmap_set (kvmap_t *map, const void *key, const void *value);

void *kvmap_get (kvmap_t *map, void *key);

void *kvmap_remove (kvmap_t *map, void *key);

There are lots of possible objections to this, but perhaps the worst thing is that the actual data structure is visible to the user of these APIs. Looking at the above, it seems likely that the map is currently stored as a sorted list (which is a perfectly reasonable representation for small key-value maps, especially if look-ups dominate), but in the future we might want to use a more sophisticated structure—or even (and this is what Core Foundation does on the Mac) choose a structure based on the data we have. If we expose it to the user, we can’t.

Textbooks and articles you might have read in print often suggest the following “solution”:– use void *. Then our header looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
typedef void *kvmap_t;

typedef int (*kvmap_key_comparator_t)(const void *k1, const void *k2,
                                      void *userdata);

kvmap_t kvmap_create (kvmap_key_comparator_t compare_keys, void *userdata);

void kvmap_destroy (kvmap_t map);

void kvmap_set (kvmap_t map, const void *key, const void *value);

void *kvmap_get (kvmap_t map, void *key);

void *kvmap_remove (kvmap_t map, void *key);

Then in your implementation routines you’ll need to write something like

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#include "kvmap.h"

typedef struct {
  unsigned size, count;
  kvmap_entry_t *entries;
  void          *userdata;
} kvmap_internal_t;

...

void *
kvmap_get (kvmap_t map, void *key)
{
   kvmap_internal_t *pmap = (kvmap_internal_t *)map;

   ...
}

Looks good, right? I mean, we can’t see the data any more.

Well…

It’s better, in the sense that you indeed cannot now see the implementation quite so obviously. Unfortunately, using void * means that there is no longer any useful type checking. We can pass any pointer into the map parameter of any of these functions, and we can probably pass the map pointer to any number of other functions accidentally also.

There is, however, a better way.

C supports pre-declarations of structured types, and will allow you to use a pointer to a structured type whose contents you have not specified yet. We can use this feature to write a better version:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
typedef struct kvmap *kvmap_t;

typedef int (*kvmap_key_comparator_t)(const void *k1, const void *k2,
                                      void *userdata);

kvmap_t kvmap_create (kvmap_key_comparator_t compare_keys, void *userdata);

void kvmap_destroy (kvmap_t map);

void kvmap_set (kvmap_t map, const void *key, const void *value);

void *kvmap_get (kvmap_t map, void *key);

void *kvmap_remove (kvmap_t map, void *key);

Now in your implementation you can just define struct kvmap. You don’t need to do this in the header file, and you don’t need a typedef:

1
2
3
4
5
6
7
#include "kvmap.h"

struct kvmap {
   unsigned size, count;
   kvmap_entry_t *entries;
   void          *userdata;
};

then when you want to use it in your functions, you can just treat the map argument as a pointer.

The best part is that the C compiler will now complain if you try to pass anything that is not a kvmap_t as the map argument. And if you have other abstract data type implementations (maybe you have some list routines as well), it will complain if you pass your kvmap_t into one of them accidentally as well.

There’s loads more I could write on this topic, but for now, just remember that it’s OK in C to define, declare and use a pointer to a struct type that you haven’t defined. You can’t dereference it or do pointer arithmetic on it, but that’s kind of the point :–)