Alastair’s Place

Software development, Cocoa, Objective-C, life. Stuff like that.

1st January 2015 VAT Changes

On the 1st of January 2015, some changes to European Union law come into force that significantly affect the way that VAT works for “electronic services” delivered to consumers. The laws in question were actually changed back in 2008, but because of obstruction from some member states that benefit from the status quo, the date at which they came into effect was pushed back by six years.

If you are a software developer selling software in the European Union, these changes matter to you. There has been very little publicity thus far about these changes (that will change as we get closer to the end of the year), but given that you may need to make changes to your website, it seems like a good idea to tell you about them now.

So, what’s changing? Currently, if you are established in the European Union and you sell downloadable software to a customer who is also in the European Union, you always charge VAT in your country, following the rules in your country, and you pay it to the tax authority in your country. This is simple, because there is only one set of rules to follow, and it’s the one for your country.

As of the 1st of January, the VAT will instead be due in the customer’s country. If there were no other changes to the rules, you would therefore be obliged to register for VAT in other member states, according to their rules, and submit multiple returns every quarter (or at whatever period they specify). That means you might have to register with up to 28 member states, apply 28 different rates, 28 different sets of rules, make 28 times as many VAT returns and 28 separate payments in difference currencies (with currency conversions and rounding following different rules in different jurisdictions). For a small software company or an independent developer, this is clearly not going to work.

There are two other changes that are also coming in at the same time that mitigate this problem. The first is that app stores will be responsible for charging and remitting consumer VAT. Apple already does this, but some other app stores may not. Under the new rules, they will have to, so you will only have to deal with VAT as it applies to transactions between you and the app store provider.

If you sell direct to consumers, that doesn’t really help, though. What will help is that EU member states are going to operate a system known as the Mini One Stop Shop (or MOSS for short). This is similar to the scheme that has been operating for businesses outside of the EU selling to EU customers, whereby you can register with a single tax authority, submit a single return to that tax authority, and pay all of the tax due to that one place. You are still required to charge VAT at the rate applicable in the customer’s country, and in various respects the rules in that country will still apply — with some simplifications. Registration for this new scheme starts in October, and, unless you plan on only selling via an app store, you will probably want to register for it.

The other slight complication is that after 1st of January, you will need to keep two non-conflicting pieces of evidence to identify the location of your customer. HMRC has indicated, at least in the case of the U.K., that they will be fairly relaxed about this evidence — so, for instance, they realise that IP geolocation may not be 100% accurate, and that some customers may lie and give you false details. It also does not matter if you have more data that conflicts with your two non-conflicting pieces of evidence; all you need is those two. However, this affects all of your sales, not just those to customers in the EU, since it applies equally to your decision not to charge VAT to customers because they are not in any EU member state.

Why am I telling you about this? Because I’m a member of H.M. Revenue and Customs’ MOSS Joint SME Business/HMRC Working Group. Those of you who are in the UK, if you have queries about the scheme, or issues you would like to raise with HMRC, please do get in touch and I’ll try to help out. (If you are a member of TIGA, they have a couple of representatives on the working group also, so you can talk to them too.)

Finally, I will add that the law changes are already made — back in 2008 — so the scope for changing the rules at this stage is very limited. What we can influence to some extent is how they’re enforced and whether HMRC is aware of problems the new rules may cause us.

I’ll be posting some more on this topic over the coming weeks and months.

Dmgbuild - Build ‘.dmg’ Files From the Command Line

I’ve just released a new command line tool, dmgbuild, that automates the creation of (nice looking) disk images from the command line. There are no GUI tools necessary; there is no AppleScript, and it doesn’t rely on Finder, or on any deprecated APIs.

Why use this approach? Well, because everything about your disk image is defined in a plain text file, you’ll get the same results every time; not only that, but the resulting image will be the same no matter what version of Mac OS X you build it on.

If you’re interested, the Python package is up on PyPI, so you can just do

pip install dmgbuild

to get the program (if you don’t have pip, do easy_install pip first; or download it from PyPI, extract it, then run python setup.py install). You can also read the documentation, or see the code.

It’s really easy to use; all you need do is make a settings file (see the documentation for an example) then from the command line enter something like

dmgbuild -s my-settings.py "My Disk Image" output.dmg

The code for editing .DS_Store files and for generating Mac aliases has been split out into two other modules, ds_store and mac_alias, for those who are interested in such things. The ds_store module should be fully portable to other platforms; the mac_alias module relies on some OS X specific functions to fill out a proper alias record, and on other systems those would need to be replaced somehow. The dmgbuild tool itself relies on hdiutil and SetFile, so will only work on Mac OS X.

Bit-rot and RAID

There’s an interesting article on Ars Technica about next-generation filesystems, which mentions something it calls “bit rot” — allegedly the “silent corruption of data on disk or tape”.

Is this a thing? Really? Well, no, not really.

Very early on, disks and tapes were relatively unreliable and so there have basically always been checksums of some description to let you know if data you read is corrupted. Historically, we’re talking about some kind of per-block cyclic redundancy check, which is why one of the error codes you can receive at a disk hardware interface is “CRC error”.

Modern disks actually use error correcting codes such as Reed-Solomon Encoding or Low-Density Parity Check codes. A single random bit error under such schemes can be corrected, end of story. They may be able to correct multiple bit errors too, and these codes can detect more errors than they are able to correct.

The upshot is that a single bit flip on a disk surface won’t cause a read error; in fact, the software in your computer won’t even notice it because the hard disk will correct it and rewrite the data on its own.

It takes multiple flipped bits to cause a problem, an in most cases this will result in the drive reporting a failure to the operating system when trying to read the block in question. The probability of a multi-bit failure that can get past Reed-Solomon or LDPC codes is tiny.

The author then goes on to make a ludicrous claim that RAID won’t be able to deal with this kind of event, and “demonstrates” by flipping “a single bit” on one of his disks to make his point. Unfortunately, this is a completely bogus test. He has, in fact, flipped at many more bits than just the one, and he’s done so by writing to the disk, which will encode his data using its error correcting code, resulting in a block that reads correctly because he’s actually stored the wrong data there deliberately.

The fact is that, in practice, when an unrecoverable data corruption occurs on a disk surface, the disk returns an error when something tries to read that block. If a RAID controller gets such an error, it will attempt to rebuild the data using parity (or whatever other redundancy mechanism it’s using).

So RAID really does protect you from changes that occur on the disk itself.

Where RAID does not protect you is on the computer side of the equation. It doesn’t prevent random bit flips in RAM, or in the logic inside your machine. Some components in some computers have their own built-in protection against these events — for instance, ECC memory uses error correcting codes to prevent random bit errors from corrupting data, while some data busses themselves use error correction. If you are seeing random bit flips in files that otherwise read OK, it’s much more likely they were introduced in the electronics or even via software bugs and written in their corrupted form to your storage device.

An aside: programmers generally use the term “bit rot” to refer to the fact that unmaintained code will often at some point stop working because of apparently unrelated changes in other parts of a large program. Such modules are said to be suffering from “bit rot”. I’ve never heard it used in the context of data storage before.

Sexual Harrassment and Developers

Today I read yet another awful tale from a female attendee at a conference. This one was worse than average because it includes an actual sexual assault and involved the lady in question’s own boss, co-workers and friends. What happened to this lady, and numerous others, is awful.

So what’s this post about? Well, two things really.

The first has been annoying me for some time; when this kind of incident happens, certain commentators come out and tell us that there is some kind of widespread problem of sexism in the developer community and that that is the cause of these incidents. This is, in my experience (which we’ll come to in a bit), untrue. There is a problem of sexism with some people, both male and female, and some of them happen to be developers. It might seem to some female developers as if the problem is widespread, but as this excellent piece from Ian Gent points out, the fact that there are relatively few female developers means that they will (unfortunately) see a hugely disproportionate number of incidents. You might say that means that it’s important to do something about the people causing them, and I’d agree with you. It is, however, a long way from being a “widespread problem”.

On the other hand, and this is important too, sexism can be quite subjective. Not everybody agrees on (for instance) what remarks are and are not acceptable, and sometimes even when two people both think something is inappropriate, they may cite entirely different reasons (for instance, the TechCrunch idiocy is, IMO, stupid, unprofessional and crass, as opposed to sexist), and there have also been incidents where not everybody agrees that the line was even crossed. That last one was presented by many people (including some I follow on Twitter) as an example of sexism, and vigorously defended as such, but actually for many of us, the nasty aspect of it was that someone was fired for making a joke, privately, to his colleague. OK, a mildly blue joke, but sacking them for it was wildly disproportionate, and it backfired horribly on the lady who complained about it too. There’s a good summary on Venturebeat.

The second thing is an observation. Male software developers are a fairly socially inept group, and a reasonable proportion have had relatively little interaction with the opposite sex. This is not to excuse bad behaviour, but it is perhaps worth being aware, if you are a woman attending a tech conference or working in a tech company, that the people around you have more in common with Sheldon Cooper than with George Clooney, and so you might need to make it quite obvious if someone is doing something you don’t like. The poor lady who wrote the post I linked to above said at one point that her boss was kissing her but she “didn’t reciprocate”. That may very well not be sufficient. Remember, some of the people at a tech conference may never have been kissed. They don’t know what it’s supposed to feel like (and, actually, people do do it differently — some people really don’t kiss back very much, whereas with others it’s really obvious they’re into it). If someone is doing something you don’t want, tell them, CLEARLY. Don’t try to save their feelings, don’t try to be subtle; they may not get it.

Now, I’m sure the same people who feed us the widespread sexism meme will get very annoyed at what I just said, and bang on about how it’s never acceptable and the woman needs to say “yes” and so on. Look, we all know what the law has to say, but unless you’re spectacularly naïve and inexperienced you also know that that isn’t quite how the real world works. It isn’t even quite how most of us want the real world to work… most of the people I’ve kissed or been kissed by didn’t look me in the eye and go “Yes!” — it just happened — and nobody wants to sign a ten-page form before coming within two feet of each other.

The trouble is that implied consent, which actually is the norm, is difficult to convey and difficult to interpret; even more so if you’re socially inept and drunk. What is clear, even to someone totally socially inept, even when drunk, is a clear “No”, coupled if necessary — and hopefully it won’t be — with physical resistance. This also has the benefit of making it obvious to anyone else present that there is a problem. Most people will intervene to help you, if they know there is something wrong. If, on the other hand, something is happening and you aren’t doing anything to stop it or giving any indication of distress they will probably regard it as none of their business.

“But you don’t know what you’re talking about”. I can hear it already. Well, actually:

  1. I’ve had to fend off unwanted advances, from men too (and yes, men are more aggressive sometimes than women). So yes, I’ve been on the receiving end, as it were. Being clear about it works.

  2. When I was at school, I managed to make the life of one of the girls in my class utterly miserable. I didn’t mean to — it was the last thing I wanted — but, partly because I didn’t understand that she was not and would never be interested in me, I upset her.

    To be clear, I never touched her, so we’re certainly not talking about anything like the incident I linked to above, but I needed to be told that I was upsetting her. Looking back, I can see that she tried to be subtle about it, but as a teenager I was about as socially inept as they come and there was really no chance of me “getting it”. Had she clearly told me that I was upsetting her and asked me to stop behaving the way I was, much heartache could have been saved for both of us.

    As it was, she burst into tears on her way home, I heard about it second hand, and I spent the next five days pretty much in tears the whole time about how I’d hurt her.

    I’m ashamed to this day of this episode in my life, and it put me off even trying to have any kind of relationship for quite some time afterwards, just in case I got things wrong again and upset someone.

So I do have some perspective on this issue.

Let me re-iterate: if someone is doing something you don’t like, please TELL THEM. It’s no good letting them carry on doing it and expecting that they will notice that you aren’t kissing them back, or that you’re ignoring them, or some other such thing. Also, go for clarity. “No” is good; it’s simple, to the point and should have the desired effect.

Weev

There are numerous articles floating around the ’Net deploring the fact that the notoriously unpleasant Andrew Auernheimer (aka “weev”) has been locked up for stealing the e-mail addresses of some 114,000 different AT&T iPad 3G customers.

The most recent one to grab my attention was a piece by Errata Security’s Robert Graham which appears to intentionally misinterpret various aspects of the U.S. Government’s case against weev in order to support the notion that the CFAA (the U.S. equivalent of the U.K.’s Computer Misuse Act) could be used to arbitrarily prosecute anybody, and makes various arguments as to why the prosecutors’ case amounts to a “liberal reinterpretation of the rules of the Internet” to find weev in violation.

Unfortunately, as with most such claims I have seen, Graham has confused the technical and legal arguments in order to arrive at the conclusion he wants to reach.

First, he discusses “User-agent rules”, implying that because the User-Agent header is not intended to be used as a means of identification, the fact that Auernheimer’s accomplice, Spitler, spoofed the iPad’s User-Agent string should be disregarded. What he fails to mention is that while the User-Agent is not supposed to be being used as a means of identification, AT&T was in fact using it that way, and both of them knew this to be the case, since they knew that the AT&T server would not respond unless the User-Agent string was set to something matching that provided by an iPad.

Second, he talks about the URL, and in particular the fact that the URL is visible to end-users in ordinary web browsers and that it is also generally modifiable by end-users. Again, the facts surrounding this particular case are omitted in favour of generalisations.

In particular, Graham’s argument fails to note that the mere existence of a web server on a particular machine most certainly does not imply that anyone anywhere is authorised to access any URL at that address. Even if some URLs at a particular address are public, there may well be private URLs that are unintentionally accessible; that does not mean that you are authorised to use them.

Moreover, the URL at issue was not public — it formed part of the carrier settings for AT&T supplied to iPad users in encrypted form and was never presented in the clear to those end users for their use. In order to obtain it in the first place, Spitler had to decrypt the iOS binary (which he was not in any case entitled to be in possession of, as he did not have an iPad), and then had to hunt through the resulting data for likely URLs.

While there is a legitimate worry that CFAA or the Computer Misuse Act could end up being applied to a normal Internet user who had merely replaced ?articleID=12 with ?articleID=13, that is very much not what happened in this case. Rather, the secret URL in question made use of the device’s ICC-ID to allow AT&T to pre-fill the user’s e-mail address in a log-in form. The ICC-ID is a secret number shared between the device and the network operator; it is a bit like a credit card number, both in length and format, and in that the number is very likely visible on the SIM card and/or packaging. It is also like a credit card number in that it is possible to generate valid numbers — and in this particular instance, Spitler and Auernheimer did so based on the knowledge that the ICC-IDs issued to AT&T iPads were sequential in nature.

Note again — the mere fact that it is possible to generate valid numbers that are not yours does not entitle you to use those numbers to authenticate yourself, either to a bank in the case of credit card numbers, or in this instance to AT&T’s network. Nor does the fact that both credit card numbers and ICC-IDs provide relatively poor security guarantees mean that you are entitled to take advantage of that fact.

The prosecution has not, as Graham claims, insisted that mere editing of a URL is illegal. The unlawful act is the unauthorised access, and there is no question that both Spitler and Auernheimer knew that they were not authorised to harvest e-mail addresses from AT&T’s servers in this manner (indeed, as the prosecution points out, Auernheimer referred to it as “the theft” in an e-mail to a journalist).

Graham then goes on to worry that “legitimate” security researchers may be in fact no different to Auernheimer, which is a bit of a stretch. White hat security researchers, who ask permission (or are contracted by the target) before attempting any kind of exploit, could not find themselves in the mess Auernheimer finds himself in. Even grey hats, who may not ask permission first, would at least have told their target about the security hole and given them the opportunity to repair it. Auernheimer did not do that. Instead, he attempted to use the security breach for his own personal gain (in this case to promote himself and his security company), and purposely did not inform AT&T prior to handing the e-mail addresses over to the press.

This is not, as some are suggesting, a case of the U.S. Government making an “arbitrary decision that you’ve violated the CFAA”. That conclusion is simply not supported by the evidence.

Should Auernheimer be locked up? Undoubtedly, if not for this then for any number of other things he’s done in the past. What about for this specific offence? Yes, I think he should. It seems to me unarguable that AT&T did not intend for anyone to be able to garner a list of e-mail addresses from their server. It also seems unarguable that Auernheimer and Spitler knew this — i.e. they knew that their activity would constitute unauthorised misuse. Auernheimer is certainly unpleasant but he is not an idiot, and he must also have been aware that there was a risk that law-enforcement might take issue with his activities. There is simply no way to reconcile these facts with the notion that he is somehow being victimised for doing something that is in any way reasonable.

Sad Times

This is my second attempt at writing about a subject that I find very difficult.

The summary is that my company, Coriolis Systems is not doing so well. That’s not to say that we’re going anywhere any time soon, but the fact is that as SSDs have become more prevalent, there’s less and less need for our existing products.

This is particularly true for iDefrag, our best-selling product and the one that generates most of the company’s revenue. There’s usually no need to defragment SSDs.

The situation hasn’t been helped by the rise of the Mac App Store, from which all of our current products and even one or two of those we have under development are barred because they need low-level access to the system. New users who want software are very likely to use the App Store to find it, and we can’t be in there. On the other hand, products with misleading names like “Disk Doctor” appear to be very much welcome (it’s in “Top Paid”, and it very definitely is not a disk repair tool, which is what its name implies it should be).

To help illustrate, here’s a graph showing our sales, month on month, from when the company started to the end of the last financial year.

The result of this is that, at the end of this month, I’m having to make two of my staff, Ed Warrick and James Snook, redundant, and we’re also leaving the office premises we’ve been using for the past five years. Ed and James are really good guys and it’s a real blow to lose them; you’d be a fool not to hire either of them, quite frankly.

Encapsulation in C

This comes up again and again, and I’ve seen various bad advice given, even in textbooks, so I thought I’d write a quick post about it. People are increasingly familiar with OO languages and perhaps not so familiar with plain old C, so when faced with having to drop down to pure C for some reason it’s quite common to ask how to achieve some of the design patterns they’re familiar with from OO, such as encapsulation, data hiding and the like.

What do I mean? Well, the canonical example where this kind of thing is useful is data structures, so let’s imagine that we want to make a library of functions that implements a key-value map. We might start with a header file containing something like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
typedef struct {
  void *key;
  void *value;
} kvmap_entry_t;

typedef struct {
  unsigned size, count;
  kvmap_entry_t *entries;
  void          *userdata;
} kvmap_t;

typedef int (*kvmap_key_comparator_t)(const void *k1, const void *k2,
                                      void *userdata);

kvmap_t *kvmap_create (kvmap_key_comparator_t compare_keys, void *userdata);

void kvmap_destroy (kvmap_t *map);

void kvmap_set (kvmap_t *map, const void *key, const void *value);

void *kvmap_get (kvmap_t *map, void *key);

void *kvmap_remove (kvmap_t *map, void *key);

There are lots of possible objections to this, but perhaps the worst thing is that the actual data structure is visible to the user of these APIs. Looking at the above, it seems likely that the map is currently stored as a sorted list (which is a perfectly reasonable representation for small key-value maps, especially if look-ups dominate), but in the future we might want to use a more sophisticated structure—or even (and this is what Core Foundation does on the Mac) choose a structure based on the data we have. If we expose it to the user, we can’t.

Textbooks and articles you might have read in print often suggest the following “solution”:– use void *. Then our header looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
typedef void *kvmap_t;

typedef int (*kvmap_key_comparator_t)(const void *k1, const void *k2,
                                      void *userdata);

kvmap_t kvmap_create (kvmap_key_comparator_t compare_keys, void *userdata);

void kvmap_destroy (kvmap_t map);

void kvmap_set (kvmap_t map, const void *key, const void *value);

void *kvmap_get (kvmap_t map, void *key);

void *kvmap_remove (kvmap_t map, void *key);

Then in your implementation routines you’ll need to write something like

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#include "kvmap.h"

typedef struct {
  unsigned size, count;
  kvmap_entry_t *entries;
  void          *userdata;
} kvmap_internal_t;

...

void *
kvmap_get (kvmap_t map, void *key)
{
   kvmap_internal_t *pmap = (kvmap_internal_t *)map;

   ...
}

Looks good, right? I mean, we can’t see the data any more.

Well…

It’s better, in the sense that you indeed cannot now see the implementation quite so obviously. Unfortunately, using void * means that there is no longer any useful type checking. We can pass any pointer into the map parameter of any of these functions, and we can probably pass the map pointer to any number of other functions accidentally also.

There is, however, a better way.

C supports pre-declarations of structured types, and will allow you to use a pointer to a structured type whose contents you have not specified yet. We can use this feature to write a better version:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
typedef struct kvmap *kvmap_t;

typedef int (*kvmap_key_comparator_t)(const void *k1, const void *k2,
                                      void *userdata);

kvmap_t kvmap_create (kvmap_key_comparator_t compare_keys, void *userdata);

void kvmap_destroy (kvmap_t map);

void kvmap_set (kvmap_t map, const void *key, const void *value);

void *kvmap_get (kvmap_t map, void *key);

void *kvmap_remove (kvmap_t map, void *key);

Now in your implementation you can just define struct kvmap. You don’t need to do this in the header file, and you don’t need a typedef:

1
2
3
4
5
6
7
#include "kvmap.h"

struct kvmap {
   unsigned size, count;
   kvmap_entry_t *entries;
   void          *userdata;
};

then when you want to use it in your functions, you can just treat the map argument as a pointer.

The best part is that the C compiler will now complain if you try to pass anything that is not a kvmap_t as the map argument. And if you have other abstract data type implementations (maybe you have some list routines as well), it will complain if you pass your kvmap_t into one of them accidentally as well.

There’s loads more I could write on this topic, but for now, just remember that it’s OK in C to define, declare and use a pointer to a struct type that you haven’t defined. You can’t dereference it or do pointer arithmetic on it, but that’s kind of the point :–)

Yet More Anti-filesystem Rhetoric

Marco Tabini’s article on the MacWorld site is the latest to point out the “steadily escalating war against the filesystem”, in this instance waged by Apple but Microsoft and others have been conducting operations in this arena also.

Marco unfortunately cites packages as an example of an anti-filesystem “thing” invented by Apple, which is wrong on two fronts: first, they really don’t imply anything about filesystems or otherwise, and second it was arguably NeXT that came up with the idea in this particular context, though even they couldn’t have claimed it as a unique innovation (the Acorn Archimedes, which pre-dates the NeXT machine by a year, also used folder-based applications).

History lessons aside, though, the anti-filesystem rhetoric is a mistake, founded on a whole pyramid of mistakes. The first and most important of these mistakes is the assumption that the end user is a total, gibbering idiot. Indeed, it is easy for any software developer to see how such an assumption might come about, given some of the queries we have to deal with day-in day-out, but fundamentally the idea of a hierarchical filing system is no more complicated than that of a filing cabinet. Would you be offended if I decided you were such an imbecile that it would be unthinkable that you could fathom the complexities of a filing cabinet? Yes, and rightly so.

I want to divide this post into two parts. In the second, we’ll examine why it is that users find the notion of the filesystem so confounding. But first, let’s consider the alternative model that is usually proposed.

The “Document-centric” Model

Those opposed to the notion of the filesystem like to talk about a “document-centric” interface. So, if I create a wordprocessor document in Pages on my iPad, or even in iCloud using Pages on my Mac, that document “lives” somehow “in” Pages.

At first glance, this seems a great idea. If you ask users (who are most certainly confused) where their documents are, they will often come up with explanations like “I saved it in Word”. And as any computer-literate person knows, woe betide you if you use Word on someone else’s computer to open a file that isn’t in whatever default location the Open dialog shows. There is a high probability that the result will be that said someone else will excoriate you for “fiddling” and “losing all their documents”.

So where’s the problem? Clearly it fits with user expectations as to how things behave, and that’s normally a good thing, right?

The problem is actually exposed rather neatly by the UI for selecting documents that has been adopted in some iOS software that uses this pattern. Earlier versions of the iWork applications, and also Omni Group’s software, used a file chooser that looks a bit like this:

If you have more files, you can swipe left and right to see them.

Unfortunately, this interface is only any use if you have only a tiny number of documents, and so a more recently both Apple and Omni have changed to a grid-based chooser, like this:

Looks great, right? Until you realise that now all my documents about animal husbandry are going to be mixed up with the letters I’ve sent to my bank, the copies of that report I wrote for work, etc.

What’s the logical method of fixing that, you ask? Well, Apple has already showed the way by adding groups to the iOS SpringBoard app (that’s the application chooser, for those who don’t know). Let’s take a look at that too:

Wow! What a great interface, right? Well, sure, it’s OK, though what you’ve really done here is invented a rubbish new version of a hierarchical filesystem.

Yours has exactly one level of hierarchy (so I can’t further group my animal documents by type), and probably a small limit on the number of items per group too. It is also a little sparse on the metadata front — we know the name of each document, its size and the date it was created, but a typical modern filesystem can manage quite a bit more data than that.

Oh, and to cap it all, every single application has to implement this functionality itself, and may implement it subtly differently. For instance, maybe it isn’t permissible to have two files with the same name? Perhaps there are restrictions on the lengths of filenames? Or on the sizes of files?

Additionally, because every application only has access to its own files (and exactly how they’re stored is in any case up to the application), it’s really hard for any other application to access your “My Horse” document, even if that’s what you as an end user want. You can, of course, use hacks like huge long URLs to pass data between applications, but that risks losing valuable metadata and may also create security holes in the user’s web browser in the process.

Summary: In order to fix the problems with your “document-centric” vision, you’ve been forced to reinvent the hierarchical filesystem. But your version is a bit rubbish compared to even the worst present-day filesystem.

Instead of re-inventing the filesystem in the name of getting rid of the filesystem, could we, perhaps, just use the filesystem?

So what’s wrong with the filesystem

It’s easy to see that when people talk about wanting to “get rid of the filesystem”, what they really want to do is to remove the confusion that users seem to experience when presented with simple filesystem tasks on modern computers. Unfortunately, rather than examining the cause of this problem, too many designers and developers have jumped immediately for what they see as the solution.

So what is the cause? Well, back in 1989, I got my first 32-bit micro, an Atari ST (yes I said 32-bit, and yes, I mean 32-bit; only the PC was 16-bit… the Atari, Commodore and Apple machines of the era were all 32-bit from the outset). It had, like the Apple machines but not entirely like Commodore’s line, a ROM-based operating system, and so when you looked at your disks, the chances were fairly good that they were empty. That is, the entire area was yours to use as you pleased.

If you bought an application for your machine, it would come on disks, but most likely you’d have a favourite disk or disks and you’d just copy the application and any files it needed from its distribution disk to an appropriate place on your disk(s). Yes, there were some things that were fixed (e.g. the Atari range would run, on boot, anything in a folder called “Auto” in the root directory of the disk in their A: drive, they might also load a file called “desktop.inf” containing the GEM Desktop’s preferences, and so on). But the number of those things was small, and for the most part the disk was yours.

When I first got a hard disk, a huge monster with only 20MB of storage in total, the situation was very much the same. My C: drive was mine. Yes, by that time I had a multitude of programs in my “Auto” folder, as well as some “desk accessories”, and I might have had a few more config files, some fonts and so on lying around, but overwhelmingly the layout of my files and folders was my own. I knew where to find my documents on the finer points of train-spotting because I put them there.

The “filesystem is hard” problem started, I think, on the PC. I think it started under DOS, where some business applications shipped with relatively large numbers of files that needed to be copied into a directory in order to run (in contrast, even relatively large pieces of software on the Atari platform tended to consist of a couple of files). There were good reasons for this; DOS programmers weren’t being idiotic — they just had to deal with limited address space (640KB) and if they wanted to e.g. print something, well then they’d need drivers for every available printer, because DOS didn’t know how to do that. (Contrast: the Atari platform was a GUI with a virtualised graphics device interface, and so drawing to the printer was basically the same as drawing to the screen, though you needed to use a different graphics device.)

So, when you bought Wordperfect or Microsoft Word or similar for DOS, the chances were good that you’d have a few disks’ worth of files to install. You could have copied them yourself, but that’s a bit annoying so to help you out, they’d ship with a program that would install the software for you.

With the advent of Windows, matters became worse. Windows was a large piece of software, and it had a new feature — dynamic linking — that its designers had enthusiastically adopted, breaking the API up into chunks and placing them in separate library files. Plus it was graphical, and so it needed fonts (bitmap fonts at first, TrueType later), its own printer and graphics drivers, its own networking drivers and so on and so on. Lots of files — in fact, so many that you might not want all of them installed in the precious space on your expensive hard disk. Ergo, an installer was required.

The upshot of these installers is that now you have large areas of the disk that you, the user, did not organise. Some of these areas may be fragile; if I rename “C:\WINDOWS” to “C:\MSWIN”, will it work? And what’s this “USER.DLL” file anyway? It looks big — do I need it? Can I delete “HPLJ4.SYS”? And so on.

Windows also made the filesystem less accessible by creating “Program Manager”. This was a way for Windows applications to show a single icon by which they could be started, without the end user having to know necessarily where on the hard disk the program file itself was installed. Arguably it was made necessary by the messy layout of the “C:\WINDOWS” folder, which contained a fair number of applications that shipped with Windows itself, but which was very definitely a fragile area where user tampering could cause trouble.

Additionally, the Windows “File Manager” was nothing like the interfaces provided on the Atari ST, Commodore Amiga and Apple Macintosh platforms. It had a two-column user interface reminiscent of its “MS-DOS Shell” predecessor; this interface is far from intuitive, and to many users File Manager would have been a total mystery. (Contrast: the Atari ST came with a disk that had a training program on it that taught the user to use a mouse, to create and navigate through folders and to copy, move and delete files.)

Given the ever larger number of files shipping with major software packages for Windows and the fact that File Manager and COMMAND.COM are fairly poor user interfaces for an unfamiliar user, the installer was here to stay. How many files did Word 2 install on your machine? Do you know? Do you know where? Most users didn’t, and most users didn’t care.

Some other unfortunate design choices at this point made matters much worse than they had to be. The Windows “Open” and “Save As” dialogs were similar in design to those on other systems, but because of the lack of a proper equivalent to the Macintosh Finder, the GEM Desktop or the Amiga Workbench and because of the reliance on installers, fewer and fewer users had ever really seen the filesystem. Mostly they’d typed in a few arcane commands, or even booted their machine from a disk that installed Windows automatically, and then inserted some disks that installed Microsoft Office, and that was that. As a result, when presented with these dialogs, users often didn’t know what they were looking at. Worse, they would often default to idiotic locations, like the “C:\WINDOWS” directory, or the install directory for the application in question, with the result that, since the only thing in the box that the user understood was the file name field, many users would save all their documents in the Windows folder. Or the “C:\WORD” folder. And so on.

Now, Windows 95 made some substantial improvements, adding Windows Explorer (and no, I do not mean the File Manager interface, I mean the entire desktop environment), and removing Program Manager (or, perhaps more accurately, replacing it with the Start menu). Unfortunately, a lot of users came from Windows 3, and so were already used to not knowing about the filesystem; a lot of developers carried on shipping software with large numbers of files, using installers; and Microsoft contrived to make the confusion worse by adding the “Program Files” folder and in OSR2, the “My Documents” folder, contributing further to the impression that the user’s disk should be organised more for the convenience of software developers than for their own purposes.

Rather than completely blaming Microsoft, let us at this point look at Mac OS X, which didn’t inherit filesystem problems from Microsoft, but instead has borrowed them from UNIX.

The original Mac OS was very much like the Atari ST and Commodore Amiga systems, in that the user was very aware of the organisation of data on his or her disks. As with all systems, over time, more clutter turned up on the disk, particularly the hard disk from which the system was booted, but fundamentally the Finder, like the GEM Desktop and Amiga Workbench, was designed to quickly, simply show the user what was on the disk. If you wanted to run an application, you navigated to it on your disk and double-clicked it; there was no false hierarchy like that of Program Manager or the Start Menu.

While Mac OS X inherited much from Mac OS 9 and earlier, a lot of its underpinnings came instead from NeXT, whose operating system was based on BSD UNIX. Now, a UNIX system is inherently multi-user in nature; this is quite a departure, actually, from previous consumer desktop operating systems, and it has some implications. For one thing, UNIX has a notion that there might be a systems administrator of some sort, and that it is a requirement that users can’t tamper with the system or even with each others’ files. For another, UNIX has a long tradition of hard-coded paths (e.g. you can rely on a Bourne shell existing at “/bin/sh”), and coupled with the UNIX idea of a single unified filesystem namespace, this implies again that the user cannot be in control of the disk. Well, the root disk, at any rate.

The mistake Mac OS X makes here is the same one that the various attempts at Linux on the desktop make — they expose the root of the filesystem namespace to the user, and then in the case of Mac OS X go to great lengths to hide all kinds of “special” (and fragile) folders from end users who can’t be relied upon to understand their contents or the fact that they shouldn’t tamper with them. More recently, Mac OS X removed disk icons from the desktop, leaving it empty by default — there isn’t even an icon for the user’s home folder. Small wonder new users don’t understand the filesystem if you don’t show it to them!

Finally, Mac OS X and Windows, as well as numerous third-party software packages have made matters worse by placing all kinds of extra files and folders in users’ home folders. I understand the argument for them being there, but every extra file or folder of this type is contributing to the confusion users experience when (if) they are shown their disk. Hiding it, as Mac OS X does with “~/Library” is a half-assed solution; what if I wanted a folder called “Library”? I can’t have it, that’s what. Hiding things also creates problems if users want to back up their files; e.g. should I back up “~/Library/Preferences”? Probably, but I most likely do not want to back up “~/Library/Caches”.

But the filesystem is hard

No, no it isn’t. It’s like a lady’s handbag or a gentleman’s tool box. You can imagine putting ever smaller bags within a handbag, or even smaller boxes in a tool box, and so can your users. You’d have to be really quite sub-normal to have difficulty with this notion, actually.

The “hard” part is knowing that it actually exists, or that it’s a bit like a bag full of bags in the first place, and that’s our fault as developers and designers.

So what should we do?

Well, for one thing, stop reinventing the filesystem in the name of ridding us of the filesystem. For another, Apple needs to ship a filesystem chooser in iOS, and, sandboxing or no, the user needs to be able to pick any file they like from any application that knows how to open it. That shouldn’t mean applications automatically get access to any old file — I’m quite happy for the user to pick it.

Second, treat the user with some respect. Stop putting system files and application files in users’ home areas without asking. There’s an argument for storing preference files there, and maybe even for allowing users to install plug-ins and drivers and things, but in that case you need to add exactly one folder (which should probably be called “System”, not “Library”, as the chances of a user wanting a folder called “System” are quite small) and everything should go inside it. You might even care to stick a “Read Me” file in it to explain to users what it is. You might convince me of the need for a “Temp” folder or similar as well. But that’s that, and neither of these should end up deeply nested or with lots of data in them.

Third, show the user their home folder. Put it on the desktop. And show them any disks or storage devices they attach too. Don’t hide them away, and don’t go creating mysterious files and folders on them without being told to.

Finally, stop spouting rubbish about the filesystem. It isn’t hard, it isn’t complicated, and users can understand and use it.

Why Not Use a Spreadsheet?

This BBC news article reminded me that I wanted to write a short piece about spreadsheets, and in particular about an entirely non-obvious danger that spreadsheets pose to their users.

What do I mean? Well, in particular, calculators and spreadsheets may give different answers for the same calculations! That fact is surprising to many, even to people who should know that that is the case.

Why does this happen, and why should I care? OK, so the first thing you need to know is that the calculator on your desk probably represents numbers the way you think about them — i.e. in decimal. So, on your calculator, when you see 2.1 displayed on the screen, the number the calculator holds in its memory really is 2.1.

Your computer, on the other hand, prefers to use binary rather than decimal to store numbers, the reason being that manipulating binary numbers is hugely faster for a computer. Now, in binary, 2.1 is 10.0001100110011… a recurring fraction. As a result, when your spreadsheet shows you the number 2.1, it is lying. The number it has in its memory is not 2.1; it is very close to 2.1, but it is actually 2.0999999…

I don’t care, you say. Well, maybe you do, maybe you don’t. For instance, if you take your calculator and enter 0.1+0.1+0.1-0.3, you get the expected answer 0. If you do the same in a spreadsheet, it may show you 0, but it will actually have calculated something like 5.55×10-17. Similarly, on your calculator, 0.1×0.1-0.01 is 0, whereas on your computer, it is very probably around 1.73×10-18.

Worse, the chances are that the people who wrote your spreadsheet software knew that this problem existed, and so they try to hide it from you. Well-written binary floating point libraries will always attempt to find the shortest decimal that matches the binary representation they have, so you will often find that the answer looks the same on the screen.

At this point, unless you’re a pedant, you probably still believe that you don’t care — after all, the inaccuracy is very small. But let me convince you otherwise; imagine you are an examiner, marking an exam script. Further imagine that students have been told to present their answers correctly rounded to two decimal places. Set the cells in your spreadsheet to round to two decimal places (this is usually an option under the Format menu) and enter the following into a cell:

=3.013 * 5

You should see the correctly rounded answer, 15.07 (the actual answer is 15.065). Now let’s imagine we are also told to subtract 15 from it; enter

=3.013 * 5 - 15

The chances are quite good that your spreadsheet is now showing 0.06, and not 0.07. Entering the same thing on your calculator should verify that this is wrong(you’ll get 15.065, minus 15 = 0.065, rounded to 2 d.p. is 0.07).

If the exam board had provided a spreadsheet to its examiners to help with the marking, and it causes this kind of error, students are going to lose marks for writing the correct answer. That might make the difference between someone going to university and not; you might have messed up their entire life simply because you didn’t understand that spreadsheets do arithmetic in binary and not decimal.

How can we fix this problem? Well, computers can accurately represent integers, so you could just multiply everything by 1,000 and then divide at the end; i.e. enter

=(3013 * 5 - 15000) / 1000

which will correctly round to 0.07. Yes, that’s right, you get different answers in your spreadsheet from

=(3.013 * 5 - 15)

and

=(3013 * 5 - 15000) / 1000

and not only that, but the latter is more accurate in spite of having an extra calculation in it (welcome to floating point, by the way).

The best solution, of course, is to use something that does decimal arithmetic when you actually care about having an accurate decimal result.

Open Source Entitlement

Some days I wonder why I bother. I’m sure others who have open sourced their code have had (and continue to have) the same experience. In fact, I’ve read about it, so I know it affects all of us, but here’s a summary of events that have consumed a substantial amount of my time today:

  1. At around 7:35pm last night, Andreas Jung of zopyx.com sent me an e-mail to ask what had happened to pwtools 0.3 as it seemed to be MIA.

  2. I took a look at PyPI and found that, indeed, the record for version 0.3 had inexplicably vanished. No problem — I added it back, and also posted version 0.4 (with an updated word list).

  3. This afternoon, Andreas sent me the following terse e-mail:

    Why can’t you just upload a release file to PyPI? If you host is down then everybodys buildouts are broken.

Now, leaving aside the fact that I’d fixed his problem already, and that it had nothing to do with my website whatsoever, the fact is that this e-mail is rather rude. Andreas is using a piece of software I’ve released for free in source code form, and is now demanding that I do something for his personal convenience. Since I was busy, I pointed out that the problem wasn’t that I hadn’t uploaded a file to PyPI, and that he couldn’t rely on PyPI in any event if he wanted his buildout to work no matter what. (PyPI has probably had more downtime than my site in recent times anyway, even if there are mirrors of it available these days.)

A couple of e-mails later, after I complained about his rude e-mails and sense-of-entitlement, Andreas informed me that

This is the typical egotistic Python package maintainer mentality. …I call this clearly asshole attitude…Once again: this is egocentric asshole mentality.

Hardly surprising that he finds it typical, I think. Andreas then proceeded to fork the package (fair enough, it’s MIT Licensed), but has left the author field on his fork set to my name, the website field set to my website, and changed the package description to

pwtools provides a robust password generator and a password security checker based on the design of libpasswdqc. pwtools does not use code from libpasswdqc, but is implemented in pure Python. This is a fork since the primary maintainer refuses to upload release files on PyPI.

This is completely untrue. In actual fact, the only reason I hadn’t uploaded the source distribution is that when I started out with PyPI, it was called the cheese shop and didn’t support that. Sometimes I forget that these days I should do python setup.py sdist upload rather than just registering the new version. Not a big deal, but equally not something I’m going to rush to fix just to satisfy Andreas Jung (who, I expect, is making money from whatever he’s using Python for).

As a result, I’m now faced with having to waste the PyPI maintainers’ time asking them to fix his forked package’s record.