Do More with Less Today

DevOps can mean different things to different people. At Stelligent we consider DevOps to be a collection of principles and practices for delivering software faster, more efficiently, and more securely.

When implemented effectively, these principles and practices will make your team more efficient and effective, allowing you to do more with less, thereby saving money. This is supported by evidence, with research for the State of DevOps Report indicating that “highest performers are twice as likely to meet or exceed their organizational performance goals”, including profitability and customer satisfaction.

In this post we’ll cover what some of those principles and practices are, why they’re important, and how they reduce cost and increase efficiency. We’ll also cover metrics you can use to measure your current DevOps capabilities and monitor progress. Finally we’ll look at how you can implement these ideas to reduce cost and increase efficiency.

DevOps Principles and Practices

People and Culture Matter

It isn’t an accident that people and culture are at the top of this list. Having the right people, and employing a DevOps culture, will make more of a difference to your long term success than anything else on this list.

What do we mean by DevOps culture? First and foremost a DevOps culture is about shared responsibility for delivering software.

In a traditional organization software development teams and infrastructure operations teams are siloed and given different objectives. Software teams are tasked with delivering new features to customers as quickly as possible. Operations teams are tasked with ensuring the stability of the system and minimizing downtime.

At first glance this probably doesn’t seem too bad. Everyone specializes in what they’re good at and is tasked with achieving an important organizational goal. So what’s the problem?

There are a couple.

First, you have a software development team that is not incentivized to ensure the security and stability of the software they write. That’s the operations team’s problem. Your developers will just create the feature as quickly as they can and get it out the door.

Second, you have an operations team that is only incentivized to maximize system stability, and has no interest in facilitating the addition of new features. That’s problematic because the best way to ensure system stability is to prevent changes.

And that’s the problem, you now have two teams with objectives that are in direct conflict, and that will never end well.

The solution to this problem is to tear down the silo walls, align incentives, and have your teams share responsibility for delivering new features and ensuring system stability. If you can do that, you’ll be well on your way to improving the efficiency of your teams.

Share Knowledge

This is part of having a DevOps culture, but it’s important enough that it deserves its own section.

It can be an unfortunate tendency for engineers to not actively share knowledge. The reasons behind it are often benign: a lack of incentive to share knowledge, other priorities, not enough time, etc. Sometimes the reason can be less benign, and engineers will hoard key knowledge to ensure they are a key member of the team and their job security is ensured.

Whatever the underlying motives, the result is the same: an organization that is siloed, secretive, and slow to change. Then when team members are hit by the proverbial bus and leave your organization, their accumulated knowledge is lost and damage is done to the organization.

Losing this knowledge often translates into slower development and decreased stability of your software systems, both of which cost money to remedy.

Having a culture that encourages and rewards the sharing of information eliminates many of these problems, and makes your organization less susceptible to losing key knowledge when team members head for greener pastures. Incentivize cross-training by setting up regular sharing sessions for engineers, and encouraging engineers to present and participate.

Automate Everything

At Stelligent a fundamental goal is to “automate everything”.

Why do we think automation is so important? Because automated processes are reliable, predictable, and repeatable, while manual processes are usually anything but. You also can’t lose process knowledge to brain drain when the process is automated and stored as code.

There are whole books that have been written on CI/CD and automation, (Continuous Integration, Continuous Delivery, Infrastructure as Code) so we won’t get into too much detail here. At a high level automated processes are better than manual processes because they are standardized, repeatable, and self-documenting. If you want more details, we have plenty of blog posts on automation which you can check out (Continuous Compliance on AWS, Deployment Automation, Infrastructure Automation).

What Metrics To Use

We know DevOps can help us reduce cost and increase efficiency. How do we go about measuring our progress and whether or not we’re actually improving? How do we know that we’re measuring what actually matters?

Thanks to research by Forsgren, et al. for the State of DevOps Report and the book Accelerate, we know that the four key metrics for measuring DevOps success are deployment frequency, lead time for changes, change failure rate, and mean time to restore service.

We’ll briefly cover each of these metrics and why they’re important below, but if you’d like to learn more about these metrics you can start by reviewing Stelligent CTO Paul Duvall’s post on Measuring DevOps Success with Four Key Metrics. We also highly recommend the aforementioned State of DevOps Report and Accelerate.

Deployment Frequency

Deployment frequency is a measure of how often you are deploying new code or features to production. Deployment frequency is important because it gives you an indication of how often you’re able to deliver something of value to your customers.

Lead Time for Changes

Lead time for changes measures how long it takes for new code and features to make it into production after being committed to your code repository. Being able to make changes quickly means you can respond quickly to changing market conditions and customer needs, which helps keep you competitive.

Change Failure Rate

Change failure rate is a measure of how often changes to production need immediate remedy, usually by rolling back those changes.

These occurrences are bad and we want to keep them as low as possible. Successfully doing so indicates we have strong processes in place, including robust test automation, which allows us to keep the change failure rate low while simultaneously deploying code quickly and frequently.

Typical causes of a high change failure rate include insufficient test automation and change batches that are too large.

Mean Time to Restore Service

Mean time to restore service (often abbreviated MTTR) is a metric indicating the average amount of time it takes to restore service when an outage or other incident occurs. Minimizing this amount of time is important. When services are down you aren’t providing services to your customers and you’re harming their trust in your ability to provide those services.

Using DevOps To Reduce Cost and Increase Efficiency

Now that we’ve covered DevOps principles, practices, and metrics, let’s look at how to apply these ideas to reduce cost and increase efficiency.

Gather Metrics

We’ve established what metrics you should be using, so now let’s discuss how you can actually collect them, and how to use them to judge your organization’s performance.

Deployment Frequency

Deployment frequency is fairly straightforward to measure, especially if you’re already doing automated deployments. To measure it we simply need to count the number of times we’re deploying code to production over a given period of time. We can do so by adding a step to our deployment stage that updates a database with information about when the deployment occurred. We can then query that database to calculate our daily (or weekly or monthly) deployment frequency.

Once you have that information though we need a benchmark to compare ourselves against. The aforementioned State of DevOps Report breaks organizations into four performance categories, Low, Medium, High, and Elite. Low performers have a deployment frequency of between once per month and once per every six months. Medium performers deploy between once per week and once per month. High performers deploy between once per day and once per week. Elite performers deploy on-demand, multiple times per day.

Lead Time for Changes

Lead time for changes is slightly more complex to measure than deployment frequency, but it can still be measured relatively easily. To measure lead time for changes we need to record when changes to code are made, and when those changes are deployed to production. Like with Deployment Frequency we can add steps to our deployment pipelines that record the time at which those changes occur, and when the changes complete our deployment process and make it into production.

For Low performers, average lead time is between one month and six months. For Medium performers the average time is between one week and one month. High performers have an average lead time of between one day and one week. Elite performers have an average lead time of less than one day.

Change Failure Rate

To measure change failure rate we need to be able to record the number of deployments we make, and the number of these deployments that fail or have issues that require immediate rectification. The most straightforward way to measure this is to record the number of rollbacks we perform, and divide that by the total number of deployments performed.

Low performers have a change failure rate that averages between 46% and 60%. Medium, High, and Elite performers all have a change failure rate between 0% and 15%

Mean Time to Restore Service

Measuring mean time to restore service requires you to record when service disruptions begin and when they end. How you go about doing that will depend on how you monitor your services, but this metric is just the average amount of time that each outage lasts.

Low performers have an average MTTR between one week and one month. Medium and High performers average less than one day, and Elite performers average less than one hour.

Change Culture

Employing a DevOps culture, as covered previously, ensures your teams are properly aligned and incentivized. That will allow them to achieve better outcomes with the same or fewer resources, which means money saved and efficiency increased.

Sharing knowledge unfortunately won’t reduce cost in the short term, but in the long term will pay off in spades. Knowledge sharing is an investment in your human capital, making your team more knowledgeable and skilled. In the long term it’ll also reduce your bus factor and mitigate negative effects from turnover, which will help prevent service disruptions.

An automated process replaces a manual process, and manual processes are carried out by people. Every automated process is a process you no longer have to pay someone to perform. That frees them up to engage in activities that are more productive and valuable to the company, which is the definition of increased efficiency.

Once you’re able to start implementing these changes, you will see improvements in your organization’s performance, and those improvements will show up in the metrics you’ve been gathering.

Reap the Rewards

Implementing these changes and monitoring these metrics will give you the means to make positive change, and a concrete way to measure your progress. This progress isn’t for it’s own sake, the evidence suggests that it will have a drastic, positive impact on your organization’s ability to meet its goals.

According to the State of DevOps Report, organizations in these higher performing tiers are “twice as likely to meet or exceed their organizational performance goals”, including “profitability, productivity, and customer satisfaction”.

Next Steps

What steps can you take to move forward on your DevOps journey? If you aren’t already, begin implementing the principles and practices we’ve outlined here. You can start by automating a few small processes and go from there. Begin measuring the key metrics, so you can monitor and measure your progress. Most importantly, keep at it. Building good automation takes time, and your automation will evolve as your organization and processes evolve. If you’re able to continuously improve, your organization will be able to become an Elite performer over time.

If you think your company could benefit from Stelligent’s knowledge and experience implementing DevOps principles and practices for some of the largest companies in the world, reach out to us. We’d be happy to discuss how we can help you start your DevOps journey.

Stelligent Amazon Pollycast