Getting to know the Chaos Monkey

Moving your infrastructure to the cloud changes the way you think about a lot of things. With the attitude of abundance that comes with having unlimited instances at your command, you can do all sorts of cool things that would be prohibitive with actual hardware: elastic scaling of infrastructure, transient environments, blue/green deployments, etc. Some things that were just plain bad ideas with real servers have become best practices in the cloud – like just randomly turning off your production servers to see what happens.
Chaos Monkey
One of the major concepts of working in the cloud is the idea of “designing for failure.” It’s mentioned in AWS’s Cloud Best Practices, and myriad different blog entries. The main idea behind designing for failure is accepting that things are going to go wrong, and making sure your infrastructure is setup to handle that. But it’s one thing to say that your infrastructure is resilient; it’s quite another to prove it by running tools that’s sole purpose is to tear your infrastructure apart.
There are a bunch of different tools out there that do this (including Stelligent’s Havoc), probably the best known is Netflix’s Chaos Monkey. It’s available for free and is open source. On the downside, it’s not the easiest tool to get going, but hopefully this post can alleviate some of that.
Chaos Monkey is free-to-use and open source, and available on Netflix’s Simian Army GitHub page. Once targeted at an Auto Scaling Group (ASG), Chaos Monkey will randomly delete EC2 instances, challenging your application to recover. Chaos Monkey is initially configured to only operate during business hours, letting you see how resilient your architecture is in controlled conditions, when you’re in the office; as opposed to seeing it happen in the wild, when you’re asleep in bed.
The Chaos Monkey quick start guide shows you how to set up Launch Configs, Auto Scaling Groups, and Simple DB domains using the AWS CLI tools. Depending on your amount of patience and free time, you might be able to make it through those. However, Netflix has another tool, Asgard, which makes setting up all those things a cinch, and [we have a blog post that makes setting up Asgard a cinch], so for the purposes of this explanation, we’re going to assume you’re using Asgard.
As Chaos Monkey will be going in and killing EC2 instances, we highly recommend working with it in a contained environment until you figure out how you’d like to leverage it in your organization. So it’s best to at least set up a new Auto Scaling group, but ideally use an account that you’re not hosting your production instances with, at first.
The first thing you need to do once you have Asgard set up is define an Application for it to use. Select the Apps menu and choose Create New Application. Create a new Standalone Application called MonkeyApp, and enter your name and email address and click Create New Application.
With your new application set up, you’ll need to create an auto-scaling group by going to the Cluster Menu and selecting Auto Scaling Groups, and then hitting the Create New Auto Scaling Group button. Select monkeyapp from the application dropdown, then enter 3 for all the instance counts fields (desired, min, max). The defaults are fine for everything else, so click Create New Autoscaling Group at the bottom of the page.
Once the auto-scaling group is running, you’ll see it spin up EC2 instances to match your ASG sizing. If you were to terminate these instances manually, within a few minutes, another instance would spin up in its place. In this way, you can be your own Chaos Monkey, inflicting targeted strikes against your application’s infrastructure.
Feel free to go give that a shot. Of course, why do anything yourself if you can just make the computer to do that for you?
To set up Chaos Monkey, the first thing you’ll need to do is set up an Amazon Simple DB domain for Chaos Monkey to use. In Asgard, it’s a cinch: just go to SDB and hit Create New SimpleDB Domain. Call it SIMIAN_ARMY and hit the Create button.
Now comes the finicky part of setting up Chaos Monkey on an EC2 instance. Chaos Monkey has a history of not playing well with OpenJDK, and overall getting it installed is more of an exercise in server administration than applying cloud concepts, so we’ve provided a CloudFormation template which will fast forward you to the point where you can just play around.
Once you have Chaos Monkey installed, you’ll need to make a few changes to the configuration to make it work:
vi src/main/resources/client.properties
Enter your AWS account and secret keys, as well as change the AWS region if necessary.
vi src/main/resources/simianarmy.properties
Uncomment the isMonkeyTime key, and set to true. This setting restricts running Chaos Monkey during business hours and when you’re playing around with Chaos Monkey, it may not be during business hours.
vi src/main/resources/chaos.properties
set simianarmy.chaos.leashed=false set simianarmy.chaos.ASG.enabled=true simianarmy.chaos.ASG.maxTerminationsPerDay = 100 set simianarmy.chaos.ASG.<monkey-target>.enabled=true set simianarmy.chaos.ASG.<monkey-target>.probability=6.0
(Replacing <monkey-target> with the name of your auto-scaling group, likely monkeyapp if you’ve been following the directions outlined above.) This is the fun part of the Chaos Monkey config. It unleashes the Chaos Monkey (otherwise it would just say that it thought about taking down an instance, instead of actually doing it). The probability is the daily probability that it’ll kill an instance — 1.0 means an instance will definitely be killed at some point today; 6.0 means that an instance will be killed on the first run. And let’s knock up the max number of terminations per day so we can see Chaos Monkey going nuts.
It’s also probably a good idea to turn off Janitor and VolumeTaggingMonkey, since they’ll just clutter up the logs with them saying they’re not doing anything at all.
vi src/main/resources/janitor.properties vi src/main/resources/volumeTagging.properties
and set simianarmy.janitor.enabled and simianarmy.volumeTagging.enabled to false in the respective files.
One you’ve configured everything, the following command will kick off the SimianArmy application:
./gradlew jettyRun
After bootstrapping itself, it should identify your auto-scaling group, pick a random instance in it, and terminate it. Your auto-scaling group will respond by spinning up a new instance.
But then what? Did your customers lose all the data on the form they just filled out, or were they sent over to another instance? Did their streaming video cut out entirely, or did quality just degrade momentarily? Did your application respond to the outage seamlessly, or was your customer impacted?
These are the issues that Chaos Monkey will show you are occurring, and you can identify where you haven’t been designing for failure.
(NOTE: When you’re all done playing around with Chaos Monkey, you’ll need to change your monkeyapp Auto Scaling Group instance counts to 0, otherwise AWS will keep those instances up, which could result in higher usage fees than you’re used to seeing. In Asgard, select Cluster > Auto Scaling Groups > monkeyapp > Edit and set all instance counts to zero, and AWS will terminate your test instances. If you’d like to come back later and play around, you can just shut down your Chaos Monkey and Asgard instances and turn them back on when you’re ready; otherwise you can just delete the CloudFormation stacks entirely and that’ll clean up everything for you.)