NetFlix Chaos Monkey and Traditional vs. Cloud Operations Mindset
NetFlix has written a lot about how they are effectively using Amazon Web Services to operate their infrastructure. I've found their development and use of the Chaos Monkey (and has even proposed its vision of a "Simian Army") to be particularly interesting. The basic premise is that all systems fail eventually so the Chaos Monkey is an automated tool that intentionally disrupts the infrastructure on a regular basis by terminating instances and, in general, wreaking havoc. Their philosophy is that you should always look at your environments as "disposable" that will always fail…eventually. In practice, this is a new mindset, but it shouldn't be. It highlights the difference between a Tradtional Operations mindset and a Cloud Operations mindset.
We architect and operate Continuous Delivery systems in the Cloud and help companies migrate their operational infrastructure to the Cloud. We've found that what prevents teams from getting the most benefit from the cloud is a traditional operations mindset.
The traditional operations mindset posits that hardware environments are not ephemeral and are something to nuture and maintain for weeks, months or even years. An informed cloud mindset assumes that since anything and everything will fail – eventually – and that all environments are considered "disposable". I emphasize an informed Cloud mindset because many take the traditional Operations mindset when moving to the Cloud.
Since we do a lot of work with AWS environments in the cloud, we've noticed some interesting antipatterns when working with traditional Development and Operations teams who aren't used to working in the cloud mindset. I've listed some of these antipatterns below.
Environment Lease Time Policies
The traditional operations mindset believes that environment lease times are perpetual. You can spot this on a project that uses the cloud when development and QA lease times are continually extended. The cloud mindset treats all environments as ephemeral. Reasonable lease times on a cloud project could be as many as 14 days or as a few as a couple of hours. There are obviously steady-state run time environments in the cloud, but even these instances should be capable of moving the entire environment to other instances at a moment's notice. In AWS, tools like the Elastic Load Balancer, CloudWatch and Auto-scaling support failover architectures such as this.
Centralized Control
The tradtional operations mindset is all built around control. This is because traditionally, it's the Ops team that's responsible for ensuring the applications are up and always running. Bottom line: their ass is on the line. This means that whenever you request a resource from an Ops teams, such as an virtual environment, database, etc. the request is put into a queue in which you must wait your turn based on the priority and request load of the Ops team. In a cloud operations mindset, control can be more dencentralized in terms of requesting a resource. This is coupled with fully versioned assets. The reason traditional operations teams typically fear decentralized control is because configuration assets are not managed or versioned. When these assets are managed and versioned, it's much easier to allow anyone to request any resource – particulaly in non-production environments because they can be easily re-provisioned or configured at any point. In cloud operations, resources requests can be asynchronous through use of fully automated configuration of environments and other resources.
Lack of Configuration Management
In a traditional operations mindset, configuration is typically hidden on someone's machine, embedded within a tool managed by the Ops team or simply in the head of one or a few people on the Ops team. The reason for this is because the Ops team must control and secure the information – particularly in Staging and Production environments. However, the problem in this approach is that the information is locked away in a few people's heads and it presents a significant process bottleneck slowing down the entire software delivery process.
In a cloud operations mindset, all configuration is managed in a database or configuration files accessible to any tool that interfaces with it on the software team. This doesn't mean that everyone has access to all configuration values in all environments (such as, say, Production), but it does mean that any team member can perform a self-service deployment without going through a separate Operations team.
Golden Images
Golden images are particular insidious because it can seem like you're doing the right thing, but you're not. Having a golden image is better than having nothing at all. A golden image is an antipattern that means that you have a snapshot of an instance/environment at a particular point in time. Some teams might even regularly snapshot their images, which is a good practice. However, the installation and configuration it took to create the image is lost. When employing the golden image antipattern, there's no way that anyone can recreate the environments in the exact same manner every single time. Moreover, the steps it might take is either locked in team member's heads or captured staticially at a particular point in time through documentation. Having documentation to manual configure the environment is definitely better than no documentation, but it signifantly reduces reliability and repeatability of environments. The cloud operations mindset says that all of the steps in creating environments are scripted and versioned in a version-control system. And, any engineer on the team should be capable of recreating these environments by typing a single command, clicking a button or it's headless through a Continuous Integration tool.
This touches on only a few of the antipatterns that occur when applying a traditional Operations mindset to the Cloud. Teams won't realize the myriad benefits when moving to the Cloud until they change their mindset.
Start thinking like the Chaos Monkey and employing a Cloud Operations mindset!