I really liked reading the Netflix techblog on a subject that is most dear to my heart – testing your data center auto-disaster-recovery capabilities.
Most of today’s services are design to withstand the unexpected and deliver “always-on” quality of service, or in service providers terminology, Service Level Agreement (SLA) that sometimes guarantee 99.9% service availability, or in simple terms, about 40 minutes downtime per month. If you are a service provider, this is translated to building a data center that can automatically recover from any unexpected scenario across any dimension. Ranging from hard disk crash in one machine, to a full data-center power outage. Every layer needs to have high availability and disaster recovery mechanism so operation really never stops.
This is complex and very expensive, and when you are finally done, you have to do one more thing – test it. Not once, but periodically, and after every software and hardware update. And continue to test it. And even after you do all of that you still don’t know for sure. As by definition, your test is somewhat theoretical. To test disaster recovery, most organizations will plan an annual or quarterly test. The test will be done during low activity time, such as maybe a weekend in 3 AM in the morning. To prepare for it, the entire data-center team will be there, including engineers, the team will rehearse the particular tested scenario so the organization makes sure that if something goes wrong during the test, the impact is minimal. The challenge is that with doing all of that, the test becomes more of a theoretical test. That is because in real life, when something goes wrong, you do not know up front what will go wrong, you do not have time to practice for the specific scenario, the operation team does not have the benefit of having the engineering group to support them in the recovery process and finally, bad things does not happen just in weekend and just in 3 AM and just when the system in in all time low activity.
In real life it is all unexpected – it happens in peak time of activity, just when the chief architect is on vacation doing rafting somewhere and no one can reach him for the next critical 12 hours and also the operation guys just added a human error in miss-configuring the router to aggravate the situation even more. And so on. That is the fundamental reason why after all the careful planning and the annual testing we still have service disruptions way too many times.
If this is the challenge, what is the solution? The Netflix team came up with a brilliant idea that get them a lot closer to understanding how the system reacts in real unexpected disaster. It all started with a question – what if a crazy monkey was running in the data center? It will be crashing random machines, disconnecting cables, unplugging complete racks, miss configuring routers (it will have to be very smart monkey, i give you that). How will the auto recovery systems work? What will the operation team see and how will they react?
There is only one way to know for sure. Let’s get a monkey and let it do it’s thing. The Netflix team did just that! They developed a software bot, they called it Chaos Monkey, and on periodic times, they let it run wild in the data center. From their blog “Chaos Monkey, a tool that randomly disables our production instances to make sure we can survive this common type of failure without any customer impact“. They were so pleased with the success of the Chaos Monkey and what they learned from it so they created a family of Monkeys. Each one is a programmed bot to test different aspect of data center resilience. For example, “Chaos Gorilla is similar to Chaos Monkey, but simulates an outage of an entire Amazon availability zone” and “Doctor Monkey taps into health checks that run on each instance as well as monitors other external signs of health (e.g. CPU load)“. More details at the Netflix techblog under The Netflix Simian Army. Well that is just brilliant!