For many, the cloud is a magical place where servers just appear and your cloud provider looks after everything, or, if they at least have a concept of the servers, they just assume that the provider will also back them up. Lots of people never bothered to think about protection in a VMware environment, so why start now?
Unfortunately, while your cloud provider probably supplies the tools, you still need to do the configuration and management. In this blog, I won’t talk in specifics about tools, but I’ll start with discussing the concepts around backups. I’ll be discussing these concepts in relation to AWS, but the IaaS elements are roughly the same in Azure too.
Oh crap! I didn’t mean to blow away that server!
This is the time when you really need to find out if your backups are configured correctly, and by that, I don’t mean “are they working”, but are they backing up the right servers, at the right times, with the right frequency. There are two big buzz terms when designing backup policies: Recover Time Objective (RTO) and Recover Point Objective (RPO).
RTO is the “how quick can I get my data back” question. This is the question that governs how you back up your data. AWS natively has snapshots for volumes, plus backup options for RDS, DynamoDB, Aurora, and Redshift. They also have Lifecycle Management (DLM) for volume snapshots. There are all sorts of options in this space, from using the native tools; roll your own using Lambda, Step Functions, SSM, etc.; or backup products from vendors using their own methods or better frontends around AWS APIs. While tools like DLM & lambda functions are cheap and may backup well, the ease, and hence speed, with which restores are done may not be the fastest.
RPO is the “how much data can I afford to lose” question. For things like app servers, or even file servers, your standard nightly backup is often sufficient. For infrastructure like database servers, some logging servers, time sensitive data, then more frequent backups are often required. Often times, it’s a mix of the two, e.g. full nightly backup of the database and hourly archive log backups, or even writing the archive logs to a secondary location, S3, upon completion.
RTO & RPO are vital to your DR, or operational recovery. They answer the questions of how fast will you get your infrastructure back and at what point your data will be. The next set of questions are for historical data recovery. This is a mixture of questions around regulations, business processes, and cost. If your industry has certain legal requirements, those need to be allowed for. The longer you keep data, the more it’s going to cost. Also, how you keep your data will have an impact on cost. Is it snapshots, is it EBS, is in in some virtualised dedupe device, or S3/Glacier?
In summary, backups of data is still the customer’s responsibility, even S3. If you are going to backup your data, it is worth taking the time to plan it properly and regularly review that plan. Think of it as business insurance, along the lines of, if I lost this data, can the business survive?