Outages of Amazon Web Services’ Simple Storage Service (S3) happen from time to time, which is exactly why you need a business recovery strategy.
Your business relies on your data to be available when and where you want it. So what would be your business strategy if there was an outage of AWS or any other cloud service for that matter? Amazon prides itself on giving you access to your data whenever you need it. This includes ways to access your data even during outages. For instance, Amazon recently claimed they've never lost an entire datacenter. That seems like a rare situation to say the least, but regardless of if Amazon is not being entirely forthcoming or not, they still have failsafes in place in the case an entire large datacenter did happen to go down.
One such incident in the near past happened on February 28th, 2017. Many businesses were reminded that “the cloud is just someone else’s computer,” as an outage in Amazon Web Services’ Simple Storage Service (S3) impacted sites across the internet. This is why you need a business recovery strategy.
For businesses whose sites were impacted, it meant loss of sales during the time the sites were unavailable, or perhaps loss of productivity until functionality was restored. But in the minds of many CIOs, CTOs, and IT professionals alike, it brought questions: “What if it were my site? Am I doing enough to ensure that my business continues to operate in the event of an outage?”
Learn How to Monitor AWS in a Hybrid Cloud Environment. Download Our Free eBook Today!
Moving to the Cloud Alone Isn’t a Panacea
Cloud service providers advertise SLAs (service level agreements) of “99-point-very-many-nines” uptime, but even a small brief outage of a few minutes can cause an SLA miss. These advertised SLAs, combined with cost-savings on hardware, real estate, and some IT staff, can make a move to the cloud very attractive from a financial perspective. As was learned from yesterday’s incident, simply moving the systems or data offsite doesn’t necessarily guarantee 100% incident-free time. Proper business recovery planning requires an understanding of where your data resides, understanding of your redundancy needs, and a plan to mitigate potential impacts if a service provider has an incident.
Return on Investment (ROI)
How much protection is enough? How much is too much? Measuring return on investment is always an exercise in metrics and statistics. Understanding the value of a system being available or unavailable can help determine how much is worthwhile to spend on a recovery strategy.
A blog site may not be worth a recovery strategy, but a blog site that generates revenue through advertisements would lose money in an outage. The cost of business lost should be measured against cost of investing in higher-tiered business recovery strategies.
Analysis of business recovery strategies should also include figuring out the recovery time objective and recovery point objective (RTO and RPO, respectively). RTO analysis needs to determine how long the business could tolerate an outage. RPO analysis defines the maximum period that the business could tolerate data loss.
That blog site from the previous example may have a low threshold for recovery time due to the loss of ad revenue, but the data on the site may only change once a week, therefore having a higher threshold for recovery point in the case where the fastest recovery procedure is a restore from backup. Consider both factors in the analysis to determine how much protection is enough.
Multi-Region Redundancy
The AWS S3 incident was limited to the region named US-EAST-1, and regions are segregated from one another. Further, options exist to use a region-specific endpoint (i.e. https://s3-eu-west-1.amazonaws.com), but if the default endpoint (https://s3.amazonaws.com) is used, this is routed by default through the US-EAST region for redirection to the correct endpoint. Although the incident was limited to the US-EAST-1 region, it’s possible that the issue was more widespread if a site relied on the redirect.
Since regions are segregated from one another, if your site had cross-region replication set up on the S3 buckets, and all objects had been replicated, and you have the ability to redirect the application to a different S3 bucket, a site owner could have taken some steps to try to restore service.
Since the actual root cause is yet unknown, it’s not possible to determine if this would have restored service faster, but there are likely many options for faster recovery if you have an IT team who knows the systems and layout – even those in the cloud.
Proactive Testing
The previous statements regarding cross-region replication contained a lot of “ifs”.
- “If” you had cross-region replication set up
- “If” all objects had been replicated
- “If” the application can be redirected
The only way to know if your team’s recovery strategies are going to work is by testing the recovery procedures often enough to have confidence in them. You don’t want to be in a recovery situation trying to figure it out for the very first time.
Redundancy Part of a Business Recovery Strategy
You’ve heard the saying “don’t put all your eggs in one basket”. Depending on the criticality of your data, you may not want to store all your data in one single cloud provider either. This option is likely both costly and complex, so return to that ROI calculation to determine how much you would benefit from having your data stored with multiple providers.
AWS, Microsoft Azure, Rackspace, and others offer enterprise-class storage options, and another option is to have a hybrid cloud/on-premises solution. Yes, you may have gone to the cloud to get rid of the on-premises systems, but depending on your ROI metrics, a hybrid recovery solution may make more financial sense than a multiple cloud provider solution.
Assess Your Situation
Downtime for your systems can result in loss of business, revenue, or productivity, all of which equate to real dollars and cents. No matter where your systems reside, the S3 outage is a reminder to assess your business recovery strategy and procedures, to ensure that the damage is minimized in the event of an incident. Don’t wait until it’s too late.