RAID is not a backup solution, times one million

Via slashdot.org (yes really, I still pull in the headlines, although the miracle of feed readers has allowed me to confirm that yes, Ars Technica is a better read), a site called Journal Space, which hosted weblogs, lost all their data. They only had a RAID setup as backup, that is, a system that mirrors content between two disks and is designed to protect against disk failure. If you’ve heard of RAID, you hopefully already know that it is not the same as a backup: if software error or an accident or a malicious act deletes data from one disk, the RAID setup faithfully mirrors it to the other disk. If not, imagine that you have two magical whiteboards. One is copied exactly to the other. If one magical whiteboard totally breaks down, excellent, you have a full copy of your meeting notes and doodles on the other. (Note for accuracy, not all RAID configurations produce a full mirror and sometimes the mirror is spread over more than one spare disk. But you get the idea.) However, if someone rubs something off the whiteboard, or falls over while holding a can of solvent and splashes it on the first whiteboard, everything on it is immediately deleted from the other.

Instead, for home machines you want, most likely, an incremental backup, that is, a separate disk/machine with several copies of your data going back in time. Your data as it was an hour ago. Your data as it was a day ago. Your data as it was a month ago. And so on. I have snapshots of my data for every three hours over the last two months. (Sensible backup programs will notice when data is the same across two or more time periods and only store it once, so your backup disk does not need to be so very much larger than your normal disk.)

For business systems you want both: the quick recovery from disk failure that mirroring systems such as RAID offer, and incremental backups. (I don’t maintain business grade systems, ask someone else for best practices if you need them. Internally consistent database backups are something you want to pay particular attention to.)

I note this because in November I gave a talk on home backups for Linux at SLUG and there is one other point of interest: do not trust third party providers to have good backups. It is getting increasingly common to have a lot of your most interesting data on someone else’s servers: your email on Google’s, your blog over at wordpress.com, contact details for all your friends on Facebook, and so on. But your provider can make both their own catastrophically bad decisions, like Journal Space, and have their creditors suddenly sell their hard disks off in a fire sale, as happened to Digital Railroad.

Which is a big problem, because a lot of third party providers do not provide an easy way to get your data (‘easy’ would be both a documented API accessible from common programming languages and an installable application), and lots don’t provide any way at all. (There’s also a whole batch of interesting issues to do with your comments or Wall postings or whatever: you don’t necessarily have the right to reproduce them and there would be privacy implications when allowing you to back them up and reproduce them on some other side. LiveJournal, for one, solves this problem by not allowing easy backups of comments left on your journal.)

If your email host, blog host, calendar host, documents host or social networking host failed or deleted your account, how would you fare?