We are few weeks after the big Amazon outage. As is appropriate but not as many companies do, Amazon provided detailed explanation of the chain of events that caused the service disruption for more than 24 hours. It is interesting and will become a case-study to many operational teams across the globe. You can read the glory details on their site Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region.
In simple and short English, the trigger of the event was a configuration change. A very common cause for things to break is that someone is trying to improve them… Hence the old saying “if it ain’t broken don’t fix it”. Having said that, sometimes even if everything goes well you still need to make changes. In this case, Amazon was working on a data center capacity increase. As part of the change, somehow, someone, did a mistake in the configuration of the network, it may have been a simple human error, it may have been something else but as a result of this mistake, network traffic was diverted to secondary servers that are designed for a limited burst requirements with limited capacity and with high latency. That simple error cause the primary servers to get internal timeouts. As a result, automatic recovery procedures started to happen in a massive scale. Basically, primary servers started to search for new backup capacity and to mirror their content. In “normal” cases of errors this is the right action to take, but in this case, where most of the primary servers started to do this concurrently, this was a catastrophe. The entire data center stop responding and got to a kind of spiral of death. It took Amazon engineers a considerable time to understand the full extent of the issue and figure out a way to return business to normal operation. The outcome, as we all know, was a major shutdown of many services, including foursquare, Reddit and even Salesforce’s Heroku. One source claims that up to 150 sites were impacted to various degrees. Other services like Netflix that have invested in a multi-region architecture have not seen the same impact and remained operational to a large degree through the entire episode.
One of the first thing to notice is the complex interdependencies of the web services, it is kind of a butterfly effect, failure in one service is causing a rippling effect and a major shutdown to many other services. These are implications that the consumers are not aware of and I doubt that even the service providers (and their stock holders and insurance companies) understand to the full extent. A service provider often consumes services from other service providers and they in turn consume services from other service providers, and perhaps even creating circular loops of dependencies. This is done from the right reasons, letting providers to concentrate on their core competencies and leverage financial efficiencies on the way. But it is also creating a really complex network of dependencies between the vendors. Amazon is an extreme case as it is perhaps the largest provider of Infrastructure-as-a-Service and thus is impacting a lot more services. By the way, this case impacted just one of the data centers Amazon is running (given it is also one of the biggest), one could only imagine what will happen is case of more major failure that will bring down the entire Amazon network. It looks impossible but things that can go wrong, at some point will go wrong. It can happen as a result of a catastrophe, or act of cyber terror (e.g. By seeding malicious code across the data centers). Just stop and think about the implications of something like that on our everyday life.
And now for the “lessons learned”. There are already numerous blogs and articles written by service providers to describe their war stories from this incident, what they have learned from it and what they are going to change. Services that were disrupted, such as, Heroku wrote blogs about the incident from their perspective and highlighted things they already marked for change to make their services more stable in the future (see Heroku blog about the incident). Even services that had no or limited impact, such as Netflix wrote about their experience and lessons learned (see Netflix blog about the incident).
Overall, it was a painful event, but one with many good lessons learned. The Service Providers as an industry demonstrated maturity, were open about the issues and seems to be actively thinking about the ways to improve the reliability of their services. This open approach was a by product of the consumer community, that showed an amazing patience and empowered the providers community to be open about this and do a proper reflection on what happened and how to make sure it will not happen again. At the end of the day, I think that there is a room for some optimistic feeling that on the long run this incident is actually a positive thing that will enable the cloud to get even better and even stronger.