Bit-rot and RAID - Alastair’s Place

There’s an interesting article on Ars Technica about next-generation filesystems, which mentions something it calls “bit rot” — allegedly the “silent corruption of data on disk or tape”.

Is this a thing? Really? Well, no, not really.

Very early on, disks and tapes were relatively unreliable and so there have basically always been checksums of some description to let you know if data you read is corrupted. Historically, we’re talking about some kind of per-block cyclic redundancy check, which is why one of the error codes you can receive at a disk hardware interface is “CRC error”.

Modern disks actually use error correcting codes such as Reed-Solomon Encoding or Low-Density Parity Check codes. A single random bit error under such schemes can be corrected, end of story. They may be able to correct multiple bit errors too, and these codes can detect more errors than they are able to correct.

The upshot is that a single bit flip on a disk surface won’t cause a read error; in fact, the software in your computer won’t even notice it because the hard disk will correct it and rewrite the data on its own.

It takes multiple flipped bits to cause a problem, an in most cases this will result in the drive reporting a failure to the operating system when trying to read the block in question. The probability of a multi-bit failure that can get past Reed-Solomon or LDPC codes is tiny.

The author then goes on to make a ludicrous claim that RAID won’t be able to deal with this kind of event, and “demonstrates” by flipping “a single bit” on one of his disks to make his point. Unfortunately, this is a completely bogus test. He has, in fact, flipped at many more bits than just the one, and he’s done so by writing to the disk, which will encode his data using its error correcting code, resulting in a block that reads correctly because he’s actually stored the wrong data there deliberately.

The fact is that, in practice, when an unrecoverable data corruption occurs on a disk surface, the disk returns an error when something tries to read that block. If a RAID controller gets such an error, it will attempt to rebuild the data using parity (or whatever other redundancy mechanism it’s using).

So RAID really does protect you from changes that occur on the disk itself.

Where RAID does not protect you is on the computer side of the equation. It doesn’t prevent random bit flips in RAM, or in the logic inside your machine. Some components in some computers have their own built-in protection against these events — for instance, ECC memory uses error correcting codes to prevent random bit errors from corrupting data, while some data busses themselves use error correction. If you are seeing random bit flips in files that otherwise read OK, it’s much more likely they were introduced in the electronics or even via software bugs and written in their corrupted form to your storage device.

An aside: programmers generally use the term “bit rot” to refer to the fact that unmaintained code will often at some point stop working because of apparently unrelated changes in other parts of a large program. Such modules are said to be suffering from “bit rot”. I’ve never heard it used in the context of data storage before.