Fun in the Cloud – Backing Up – Part 4

Simple Sync to the Cloud

Keeping backup sets in sync

The last part post described my implementation of rsync into Amazon Web Services (AWS), with storage on the Simple Storage Service (S3) and an rsync target server supported by Elastic Compute Cloud (EC2).

So, that’s it, right?  This can support any sort of data; along with a mirroring function it should be sufficient.  And that would be correct, but we’re not done yet.

There were a couple of elements of the rsync-based solution that aren’t required for keeping backup sets synchronized.  I wanted another system to keep things more simple and more efficient for data that was more compatible with the structure of S3.

This post covers keeping copies of backup sets synchronized with the cloud.  The backup sets in this case are from Retrospect, but this approach covers any similar “output from a backup application” like tarfiles, or from Bacula.  Here’s why this is a different chore than mirroring using rsync:

  • rsync copies data that might be in different formats, but treats all of it as “data” that needs to be encrypted for transport and storage.  Backup sets are already encrypted, so this level of protection isn’t needed (i.e., it’s redundant).
  • Regular data (the majority of files on a system) can be compressed to varying degrees of success, and rsync is effective at this.  Since backup sets are already encrypted, they do not compress at all.  Running these files through compression and encryption is a waste of time and resources.
  • Normally, in a filesystem there is significant variation in the size of files, with some large and many small.  Backup sets are aggregations of filesystem files, but are themselves made of “chunks” – and are generally as large as possible – in order to improve efficiency.  Additionally, backup sets generally include catalog data (metadata – data describing the attributes of the included files, and often snapshot data, which describes the state of a filesystem at the time of backup), which also increase the size of the chunks of a backup set.
  • There is value in being able to look at something that appears to be a filesystem when examining a rsync data.  However, backup applications producing backup sets generally are opaque – all you see are some number of regularly-named files, most of which will be the same size.  So, having a filesystem-like target for these things isn’t important.

It turns out that the S3 method of storing data as objects in buckets is natively well-suited for backup sets.  Retrospect produces 600MB backup chunks, for example, and that’s a reasonable size object to manage in this way.

Again, as mentioned previously, there isn’t a simple way to “mount” an S3 filesystem to our local system, and there are good reasons that we might want to avoid it if we could.  So, I went looking for other ways to simply transfer chunks to and from S3 objects natively, without wanting to expose a filesystem-like interface.

This is actually not all that difficult to do in something like Python or Java.  However, I was thinking in terms of finding a utility that would work rather than actual programming (I’m going to refer to coding this level of functionality in Python as “programming” rather than “scripting” since to my mind it’s a different approach to the problem.)

As I had been researching interfaces in Java, however, I came across the jets3t (pronounced “jet set”) project, from James Murty.  This was developed as an API for Java to function with S3, but as a bonus comes with applications that can be called from the command line to perform synchronization to an S3 bucket.  I use the synchronize application.

As a scripting project, this was fairly simple, though perhaps not trivial.  I’d be happy to share the script on request.  Here are some of the things that I either learned or had to take into account in the process of executing this project:

Don’t shoot yourself in the foot.

Most importantly, the synchronize application from jets3t has the ability, like rsync, to mirror the destination to match the source by deleting objects at the destination that don’t exist on the source.   Mr. Murty warns about this and provides options to the synchronize script to do a “dry run” of the process, but that wasn’t enough for me to avoid deleting tens of GB of data on S3 – more than once.  That’s my fault, not his; your benefit from this story is reinforcement of the old adage to “measure twice, cut once”.

Don’t do it all at once.

I have over 100GB of backup sets that I synchronize to S3.  So, among the first things I thought about was that I didn’t want to have our outbound connectivity saturated with this traffic for however many hours the initial upload was going to require.  My script dispatches subprocesses to call synchronize one chunk (file) at a time, for a definite duration (330 minutes, in my case) starting at 2330, six nights per week – the seventh being used for maintenance by the backup system itself as described below.  Just for fun, I have a random wait function, so the process doesn’t start at the exact same moment each day.

Don’t run one stream at a time.

I learned through observation that while the process is efficient, multiple processes are essential to use the bandwidth efficiently.  Another way of putting this is in contrast to the previous point: if you’re going to use all the bandwidth, then use all the bandwidth.  I found in my experimentation that four to six processes running concurrently would use all of my available outbound capacity.

This would raise the question of why bother to write code to manage multiple streams to avoid killing the link when you could just generate one stream and let it run until the process is complete without worrying about the other traffic you might want to avoid affecting.  The answer to this is twofold: first, I wanted the backups transferred as quickly as possible without affecting other traffic (since the goal is to have the data there, not here) and second, I didn’t know how this would scale until I tried it.

Logging is important

Most of my script is devoted to producing a readable report of what happened.  This is a process that runs six days per week, and that I need to be able to very quickly look at to determine that things are okay.  I hope this is the sort of process that I can just forget about – but I can only do so if I am confident that I have enough rapid visibility to it to know if there are issues, and otherwise simply see that the process completed correctly.

Maintaining Control is sometimes Work

synchronize is a terrific tool.  Thank you, James Murty!  However, for the reasons I’ve mentioned, I wanted to use it on a file-by-file basis, and that required extra attention.  So, for example, I generate a list of changes to be made by performing a “dry run” of the synchronization, then taking the results and dispatching synchronize to move each piece.  If you have different constraints, you may decide to do things a bit differently.

More on Retrospect

This process isn’t static.  As I mentioned once a week Retrospect performs maintenance on the backup sets, called “grooming”.  Since Retrospect has a catalog of every file backed up, it can maintain the backup sets to policies like “only keep x versions of a file”.   This is useful for ensuring that the backup sets do not simply continually grow without limits.

Once Retrospect has performed grooming, the directory in which the backup sets are stored will look different: there will be some deleted files (chunks for which the original contents have been aged out or otherwise reorganized), as well as chunks that are no longer 600MB in size, and added chunks.  Of course, all of this is under the control of Retrospect (and is opaque).

Once the initial collection of backup sets has been synchronized to S3 there isn’t necessarily a lot of activity most nights of the week.  However, after the groom has run there will be a flurry where many of those chunks – files in the synchronized directory – will be rewritten to S3 or deleted from S3.  Being averse to “undefined results”, I choose not to have my synchronize process run concurrently with Retrospect’s grooming.

Finally, I have in general always been keenly aware of the issue of backup size and time to restore v. full and incremental backups.  In brief, the time to restore a particular file or system to a particular state goes up as one has more and more incremental backups layered on the base (full) backup.  In addition, the size of the backup sets goes up as well.

That’s why developers have implemented grooming (and its cousin, de-duplication).  Previously, I have managed backup regimens of a full backup every x days, followed by incremental backups.  If x is 30 days, for example, one could be sure that to recover a system or file only the full and 30 daily backups would need to be searched (searchable) and/or processed, and that the total size of the backup set would only be the size of the original backed-up data and changes over 30 days.  Then, you’d want to start all over with another full backup – keeping the previous full+incremental set for some period of time.

With grooming, however, you define a policy (“x versions of a particular file”) and the system maintains that policy.  Variations of this can prevent the backup sets from growing faster than the data itself grows.  This is a Good Thing.

One proviso, though: we’ve gotten out of  the mental habit of thinking of backup media as “expensive”.  Sometimes, given issues with a backup application such as Retrospect, you may have to  “discard” the backup set because of an inability to recover from an error.  If that happens, there is quite a bit of time involved to re-export the chunks of the backup set to the cloud, as well as the cost in data transfer charges.  Unfortunately, that sometimes happens.

So, I don’t trust it completely.  Partly, because it’s opaque – in order to see the files, you need the catalog (or you need to recreate the catalog from the backup sets) as well as the application itself.  It’s difficult to test, too – bringing back everything is going to tax capacity in all sorts of ways, as well as being time-consuming to perform and verify.  So, as s postscript, my practice is to:

  • Take one snapshot of the data per month and create another copy of the system at that point.  That doesn’t get stored in the cloud; it’s historical and simply a belt along with the suspenders we already have.  It goes to my least-favorite-but-cheap USB drive, which holds as many of these as possible.
  • At some undetermined time, likely in the summer (after a year of this process) I will reset the S3 storage of the backup sets, and start over.  I’ll keep the old (current) set around until I’m comfortable with the new set (and, of course, after the new set has fully transferred to S3).

Having a process where there are multiple levels of redundancy is a good idea.  It’s even better when there is physical redundancy – and this seems like the best solution to that problem so far.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s