Tuesday, September 16, 2008

From tape backup to live standby systems

Short version:Purple

Instead of just backing up to hard drives,

let's make the backup version actually operate!


Long version:

Just a few years back, I was using tape backup to protect against server loss and accidental deletion. Of course, hard drives are much cheaper now, and I can afford to archive several copies of our important files on ordinary HDs.

Somebody on this list once said, "nobody cares about backup -- they only care about restore". And many have noted that to know you have a good backup, you have to test restoration of the data.

At risk of channeling the spirit of Extreme Programming: if testing your Restore process is good, then let's do it constantly! Instead of just storing our files on disk, let's make our backup server provide the function of the primary server. Let's call it just Poor Man's Redundancy (PMR).

My assumption here is that I'm restoring files on the backup server in the same way they were presented on the source server, using rsync, for example.

We use RT3 -- BestPractical.com's fine open-source ticket tracking system. I take a backup of its MySQL database, and its source code including our customizations. The idea would be to make this ordinary application run on the backup server.

Or our subversion archive, used for source code development, major documentation projects, etc. We enable subversion on the backup server.

Or our email server; we could setup this box to be an email server, with all of the same accounts, domains, settings, etc.

Or our ordinary WebDAV-based file server; we'd just enable DAV on the backup server pointed at the latest snapshot of files from the primary server.

Let's evaluate this PMR idea:St Matthews Lutheran Church

-- What happens if data changes on the backup server? E.g., what if somebody logs in to RT and updates a ticket on the backup server? A simple approach would just be to throw away those changes on the backup server.

-- How much more work is this going to take? If my backup process is working, then the necessary data replication should already be working. It'd be additional work to get those settings into the right place; e.g., the config file for Apache that enables RT3 needs to be actually setup for Apache -- not just stored somewhere.

-- What do I really get? If I actually use the services on the backup server, then I can test to see if things are restorable. And in effect, we run the Restore process immediately after completing each backup.

-- What kinds of things can I combine on the backup server? In my case, I could easily restore and operate subversion, WebDAV, ordinary web sites, RT, email, and DNS. It would be harder to replicate our VoIP servers, though, since they all listen on udp/5060.

-- Is relocating an installed application really practical? RT, for example, requires tons of perl modules.

-- How does this idea compare to using (a) a good external RAID with a SAN and (b) putting each service on a separate filesystem, and (c) just re-mounting the separate filesystem elsewhere? The approach I described above (PMR) would work on any hardware, and it would easily let you use the backup system running while the active system is still running.

-- What about moveable virtual machines? E.g., setup each server as a VM; then take a snapshot of the VM, and restore it to a backup server in a replica network. This gives you the ability to test the Restore process while the primary system is still running. Migrating conventional servers to VMs is not necessarily easy.





If anybody would be interested in reviewing these and other approaches to modern backup/restore for a Usenix/LISA Paper or some such, let me know.

No comments: