Backing Up via rsync to the Cloud
Amazon’s Simple Storage Service (S3)
Ah, an interesting creature, this S3.
If you read the developer’s information for S3, it strikes you as an interesting way to upload and download “objects” of indeterminate size into “buckets”… but it’s not a filesystem like we’re used to. And you can’t (natively) mount (or attach) it to a host.
Here are some of the most significant differences:
- An “object” could be analogous to a file. However, they can only be written and read, not updated. To update an object, you need to re-write it. Until recently, objects were limited in size to 5GB, but now they can be much larger. However, they must be uploaded in parts of no more than 5GB.
- A bucket could be like a filesystem. Buckets, if named correctly, can be accessed under a non-Amazon domain name if a DNS
CNAMErecord is set for the bucket.
- There are no directories within a bucket. However, there is support for naming objects with a POSIX path that can be referenced as if it were in a directory structure (like an object named
- AWS objects can be accessed by normal browsers. The bucket name can be prepended to
s3.amazonaws.comand the file reference (the object name) can be appended, to get an object like
- Amazon offers support for Access Control Lists, expiring data access, logging, and creating objects in particular regions of the planet to improve redundancy and decrease access time.
- Finally, although one can use browsers with GET and POST commands to access AWS, Amazon offers full control over the facility with REST and SOAP protocols. Third-party developers have created solutions based on these protocols.
The point to all of this is, of course, “how do you rsync to it?” And, as I indicated before, the answer is you can’t… directly.
I researched several ways to implement this. What seemed the most simple thing to use was a package called s3fs. s3fs, as you might expect from the name, allows you to “mount” an S3 bucket as a filesystem. It’s a pretty active project, and for my initial implementations of rsync’ing into S3 it seemed okay… but.
I had some problems with it. As I said, it’s a pretty active project, but it seemed to me that I was too often seeing errors, commands failing that should have worked, and the feeling that it wasn’t necessarily the right approach to the problem.
There’s another solution that seems somewhat more complicated in design, but on thinking it through actually looked like a better strategy. The s3backer project has a different approach to s3: rather than making the analogous functionality of objects-as-files, they implemented a store where each object is (accessed like) a disk block. Using FUSE (as does s3fs), s3backer exposes what appears to be a disk device that you format and mount with the filesystem of your choosing. In addition, the project developers have added code to support encryption – again, encrypting “blocks” without regard to the higher-level filesystem on top of the device.
Okay, so, I could install s3backer, and “mount” my s3 bucket on my server, and away I go. But then…
I had focused on some of the costs involved in backing up to AWS S3, but began to be bothered by some that I hadn’t really thought of. For instance, there are costs for data in and data out of S3. It occurred to me that there were a couple of reasons why I might not want to use s3backer on my own local host.
First, disk i/o is cheap on a local system, which is to say that I presumed that there is some amount of effort put into optimization by the ext3 project (the group responsible for the Linux filesystem I use) – but we’re moving the i/o at that level to the network, which is unusual. On the other hand, rsync is optimized for use across a network. I’d really rather be writing my data across the Internet using rsync, rather than block updates.
Second, I perceived that s3backer was a relatively complex package (given its dependencies) that didn’t go well with how I manage my own Linux server. I prefer to manage my own server as much as possible as a “production” machine, and even a well-behaved package like s3backer has more interdependencies (support requirements for other packages) than I wanted to deal with. Especially if I might be in a position where I might need to reinstall things on a replacement host because I needed to recover backups from S3. I really just wanted to be able to use rsync to get my stuff back if it came to that.
So, this gave me an opportunity to delve into another facet of AWS, Amazon’s EC2 (Elastic Compute Cloud). This offering from Amazon gives you access to a virtual host system running your choice of OSs (including Linux, Windows, and even OpenSolaris); they are pre-configured, and can be started and stopped as needed. However, like S3, EC2 provides services somewhat differently than what one traditionally encounters.
Here’s the plan: set up an EC2 instance to run s3backer, and serve as the destination for rsync. This way, EC2 to S3 via s3backer traffic is “within AWS”, meaning that it’s very likely to be fast, and Amazon only charges for data in and out of AWS itself (assuming you are moving data internally in the same region).
I used an EC2 “small” size instance, and decided on Ubuntu (not my normal choice) because it was easiest to set up s3backer in this environment. Following the examples in the AWS EC2 Getting Started Guide, I set up my keys and instance, along with restricting any IP access to several of my known/stable IP addresses. Alas, at this time Amazon doesn’t support IPv6.
As with the rest of this series, I’m leaving implementation of the details to the reader. However, this project didn’t turn out to be very difficult. Here is my checklist for my instance setup:
- Start: using Ubuntu AMI (10.04 LTS)
- Initial installation, customize OS, perform apt-get update/upgrade
$ sudo apt-get update $ sudo apt-get upgrade
- Install prerequisites for s3backer.
$ sudo apt-get install libcurl-4-openssl-dev libfuse-dev libexpat1-dev build-essential emacs22 ddclient
- get and install s3backer
$ wget 'http://s3backer.googlecode.com/files/s3backer-1.3.1.tar.gz' $ ./configure $ more INSTALL $ make $ sudo make install
- add AWS credentials to s3backer credentials file .s3backer_passwd
- Mount s3backer filesystem
$ # Create general mountpoint for FUSE FS $ sudo mkdir /s3_target $ sudo chown ubuntu /s3_target $ # Enable FUSE kernel module $ sudo modprobe fuse $ # Start s3backer (bucket kmp.examplebucket already created) $ sudo s3backer --blockSize=64k --size=200g --listBlock kmp.examplebucket /s3 $ # Create filesystem, mount $ sudo mkfs -t ext4 -V /s3/file $ sudo mount -o loop /s3_target/file /s3fs
- Complete housekeeping: set up startup script on instance to start s3backer.
This is basically the process I needed to go through to set up my instance. Here are the second-phase tasks:
- Installing the EC2 API tools (along with the AMI tools, although I don’t think that was necessary for this project), for functionality to manage the instances.
- Creating a script using ec2-describe-instances and ec-2 start-instances to start the instance if necessary as needed. Yes, it’s only $0.08/hour, but then again storage is only $0.15/GB/month but it all does add up!
- Creating the script to rsync data to the S3 filesystem via the instance.
- Creating the script to shut down the instance when the rsync was complete.
There’s also the question of routine maintenance of the instance, as well as monitoring it. There’s more to be done here, but at this initial point I have working rsync into S3 with only $0.08/minute added cost overhead. I think this makes sense.
The checklist is actually important, since I don’t really care about the instance itself. I did have to figure out the difference between “stopping” and “terminating” an instance (hint: the latter command will allow you to get very good with your checklist!).
At the end of this process, I have a target to send backups to “the cloud” using rsync. I can start up and shut down this host using either calls to the EC2 API in a shell script, or using Amazon’s AWS Console. The transmission of data across the Internet is encrypted, uses a protocol that we all have lots of experience with, and is efficient; and the use of s3backer is transparent to my local systems.
Next: the piece that keeps the rest of my data (Retrospect backup sets) in sync, what I’ve learned, and what I think I would do to improve the process in the future.