Most of my professional career has revolved around data. I’ve been the designated recipient of sad takes from colleagues (and friends and family) of hours of work, lost – either because of hardware or software failure. And just like anyone who brings work home, I’ve been determined, perhaps far too obsessively, to try to protect myself.
Of course, the ultimate extension of such thinking is “Disaster Recovery” – the eventuality that even the physical location storing data might be destroyed. So, for the past ten years or so, I’ve angled to store a copy of my “stuff” somewhere else.
Now, we call that somewhere else “The Cloud”. And the most well-known of a service provider is Amazon – specifically Amazon Web Services (AWS). AWS is pay-as-you-go, offering storage services, compute services in many flavors, and even Internet infrastructure services like Domain Name Services (DNS) and now email.
Provisioning for these services, especially those that are most likely to be directly used by software engineers, is interestingly different from using services provided by one’s own handiwork or business. All of these services are architected to scale – enormously.
And, on a unit basis (say, a gigabyte of storage, or hour of virtual server time) AWS services are really quite inexpensive. Of course, one may be using quite a few units, and so costs do add up. But, for example, the services that I use are $0.15/GB/month for disk storage, and $0.08/hour for a small virtual server instance… so experimentation is reasonably priced.
I decided that I wanted to keep copies of my data on Amazon’s cloud. I had several goals when this idea arrived in my head:
- I wanted to learn about the interfaces that AWS uses. Amazon actually developed the AWS APIs for their own use, and I thought it would be interesting to understand a bit more about how one provisions at this scale.
- I wanted an efficient way to protect my data from the unlikely possibility that, well, the house might blow up. I had a few different criteria for “efficient”, mainly that there are a couple of scenarios that I wanted to consider – which I’ll explain below.
- I was curious about whether the bandwidth available to me as a “home” consumer through my ISP was enough to put an offsite storage plan into practical use.
- I wanted to engineer a process that was automated to the extent that I could forget about it for the most part. Implicit in this is that I want to have a level of self-monitoring, so that I can easily tell if something has gone awry.
- I wanted storage for my data that was very highly reliable. Actually, I should be forthright and say that what I wanted was to avoid what I consider not highly reliable solutions: external disk drives.
- Finally, to protect against user-error (and, hey, that user is me, and I certainly make mistakes), the backup system should keep multiple copies of data. That is, I want to make sure that if I discover today that a file was deleted or corrupted a week ago I have a copy from a week ago – or longer – along with newer copies.
How I got here, and what I experienced along the way is the subject of the following posts. I’ll start with the question of where the initial backups go locally, how and why, and then talk about my experiences with AWS S3 – Amazon’s Simple Storage Service, and EC2 the Elastic Compute Cloud. More next week…