AWS Had A Failure And The World Survived

** Disclaimer: I work for Microsoft **

You may think this post is to smear AWS, but it’s absolutly not.

Yesterday you may of thought the world was going to collapse, as AWS had an S3 storage outage. The good thing everybody survied the outage and can move on with their work. When I look at my social feeds though, one would think that because of this outage cloud is the scariest thing, since the Gremlins movie. This outage should not make you scared of cloud, because you know what AWS S3 absolutley has had a great history of uptime. What you should take from these types of outages, is the fact that if we don’t build applications with reseliancy in mind, then your application may fail. This doesn’t matter if your applications are on-premises or cloud based. I know AWS will take this failure as a big learning opprotunity, as I know Microsoft has always taken failures as a learning opprotunity. But we should all take this as a learning opprotunity to educate ourselves on what it means to build reseliant applications.

One of things that pissed me off the most was seeing all the ambulance chasing and FUD slinging by vendors\people. I worked in the storage\infrastructure industry for many years and I can tell you, if you think your infrastructure provides better uptime you are probably lying to yourself. We are dealing with hardware, software, and people… That combination ensures there will be a failure at sometime in the lifecycle. Yes, people all make mistakes.

The lesson to learn here is if uptime is important to your application, then you need to design for failure. You have to put HA\DR in your design principals and decide if the cost outweighs the downtime. You also need to start considering how automation is just about speed, but also about reducing risk.

If you take this failure, as cloud sucks, then you’re doing yourself a disfavor.