This serves as a reminder that cloud services, while often thought of as a utility, actually arent. And that the necessary redundancies, while often seemingly built in, havent been fully tested. That means the cloud itself should have a fall-back plan for critical services.
Nature of the CloudburstNote that there has been no security breech in this instance. This is a failure of service and very similar to what would happen if you had a power outage.
So while my premise is that this not being as reliable a utility, but really isnt true: in much of the world, power is actually less reliable than Amazons service is.
It does suggest a similar approach to the problem, though, and one consistent with any service where the reliability cant be adequately assured for the class of service required by the company.
In the case of power, if you need a higher level of reliability than what the electrical utility can provide you put in place backup generating potential adequate to the task. That way you can assure the reliability you need even if the utility cant meet the requirement.
In fact in some parts of the world it isnt that uncommon to forgo the utility altogether and live off their own power generation capability. This is very similar to companies choosing a private over a public cloud solution for their business. And given that most services cant yet provide the reliability needed for mission critical applications, thats why large enterprise providers like EMC, IBM, and HP continue to do great business with their private cloud offerings.
Solution: RedundancyThe solution is that if you want to save money using the public cloud for services that require a higher service level than you believe the provider can deliver, you have to provide redundancy. This isnt that different than having a hot backup site in case of a natural disaster.
And while it clearly will eat into the savings of using a cloud service, the result also keeps you more intimate with the solution and likely provides a much better path if you need to switch cloud providers. In short, you can fail over into the backup system, switch providers and enter into test, leaving the backup system as primary until youre ready to cut over to the new service. It could actually give you added flexibility in terms of solutions providers.
Cloud Computing AirbagsThink of the cloud failover solution like airbags in a car: they are expensive components required of every automobile that are rarely used and often painful but a lot less painful than being tossed through the window or having your chest or head crushed.
The failover solution doesnt have to take the entire peak load of the business, just enough of that load to keep the company operating until the cloud service recovers. And over time these services will themselves improve in redundancy and recovery speed.
What you want to avoid is what is happening today in terms of companies being partially or completely shut down. Customers will stay with you if they have wait times (at least they will if they understand the problem is short-lived) but youll lose them if they cant connect at all.
Like any other critical service, having a fall-back plan that is tested and ready to execute can make you and your group look brilliant when others are failing.
And that alone is worth the added cost.
Wrapping Up: A WarningUse this Amazon failure as a warning that cloud services are not bulletproof and that they are likely to fail at any time for the same reasons any complex system can fail. They will then take many, most, or all of their customers off-line with them.
Design in redundancies with adequate failover and youll look like a hero when something like this happens. And while this comes with cost, it is a vastly lower cost than having your CEO wonder if his IT department needs new leadership.
In the end the Amazon failure is a reminder to us all that systems, even cloud services, need redundancies within your control and that failing to put those redundancies in place can be career limiting.