The first step is to have a local copy of the data. Think about backing a significant amount of data directly to the Cloud: this isn’t going to be a quick process. It’s also the case that recovery directly from the cloud is going to be unpleasant as well – what is the transfer time for 60GB of data? (This is called “mean time to recover” by backup admins.) One starts, then, with a backup system that has nothing to do with offsite storage of data, that simply generates a copy of the data locally.
I approach this step with two different tools. I use the rsync program to generate copies of data on the Linux server (initially also stored on the server), with changes/updates to the data preserved in a tree of directories by date. This lets me follow versions of files, so that I can fairly easily find a copy of a file from, say, a month ago even if it was updated several times since.
For about 15 years, as a Mac user, I’ve used a backup application called Retrospect. Retrospect underwent a serious renovation two years ago to bring it more up to date with the advances in Mac OS X technologies; it also supports Windows as both client and server for backups, and can support Linux as a client. This isn’t necessarily an endorsement, since the current version isn’t problem-free, but it is what I used for my Macs, and has some advantages.
Retrospect backs up data to tapes, or optical media or to any other sort of storage, and includes in the backup (the “output” – the destination of the backed-up data – is referred to as a “media set”) metadata about the files themselves. Optionally, Retrospect can compress and encrypt the backup sets and manages incremental backups, keeping the media set sizes manageable by processing them to remove more than x versions of a file. In my case, Retrospect stores the media sets on the Linux server, so these need to be efficiently mirrored to the cloud.
In terms of the specifics of my setup, there are two sets of data to be mirrored, each with different requirements:
Rsync’d Data is stored on the Linux server in
/data/Backup/Mirror, along with another tree,
/data/Backup/Mirror-history with version subdirectories by date (for example,
/data/Backup/Mirror-history/08Jan11). The rsync command that generates this is run by cron every night, and looks like this:
rsync -a --backup --backup-dir=/data/Backup/Mirror-history/$DATE --delete /data/common/files /data/Backup/Mirror/
When the command is run,
$DATE is replaced by the current date.
/data/common/files is the directory with home folders, and our application data. This command mirrors
/data/Backup/Mirror, and files that are updated in successive runs or were deleted from
/data/common/files are copied to
The Linux server also stores the media sets from Retrospect (the Retrospect server is a Mac, writing the media sets to the Linux server) at
/data/Backup/Retrospect. Retrospect writes 600MB “chunks” of backed-up data that is compressed, encrypted, and includes catalog metadata (status information about the files, and a snapshot of the system state at the point of the backup). Retrospect initiates a weekly grooming process that deletes and/or re-writes some of these chunks in its media set, and maintains a separate catalog of what files were backed up when.
rsync data isn’t encrypted, and there are a large number of files (about 12 000 at this writing); it’s simply a mirror of files in a filesystem and it’s very transparent. The version information is also transparent, stored in the Mirror-history directories, so essentially the mirror process is practically simple. Retrospect data is encrypted, and there are only about 140 media set “chunks” – but these chunks are generally 600MB, and because of the overhead of the metadata the entire media set gets regenerated a couple of times per year.
So, the approach taken to mirror the rsync data is to use rsync itself given the small file sizes and the fact that the operation of rsync – being simpler than Retrospect – can be easily verified. Although conceptually the Retrospect media set chunks could be treated the same way, because they are already encrypted and because of their size I use a tool to efficiently determine whether chunks in the cloud have been updated, run n (currently 4) concurrent transfers, and only perform the transfers late at night/early in the morning to avoid contention on my Internet connection.
Part 3 of this series talks about how I implemented rsync to Amazon’s cloud, and Part 4 discusses the programming to transfer the Retrospect chunks.