Continuous delivery seeks to release software to its users faster and more often. The AWS CodePipeline service is an extremely valuable tool for implementing continuous delivery methodologies. However, just because you have a continuous delivery pipeline, how do you know if you are realizing the benefits that continuous delivery promises? In this post, I will demonstrate how to collect metrics and generate a dashboard to assess the health of your continuous delivery pipelines. Additionally, you will be able to run the same dashboard in your own account by clicking a “Launch Stack” button and going through the AWS CloudFormation steps to launch the solution stack
As demonstrated in a prior post, we can now write code that reacts to the events that CodePipeline publishes to Amazon CloudWatch Events anytime an action, stage or pipeline changes state. I will use a similar approach by having a CloudWatch Event Rule Target invoke a Lambda function to create CloudWatch Metrics that captures the activity from CodePipeline as shown in Figure 1.
EventHandlerFunction: Type: 'AWS::Serverless::Function' Properties: FunctionName: pipeline-dashboard-event-handler Description: Create CloudWatch metrics from CodePipeline events Handler: index.handlePipelineEvent Runtime: nodejs6.10 CodeUri: . Role: !GetAtt EventHandlerRole.Arn Events: PipelineEventRule: Type: CloudWatchEvent Properties: Pattern: source: - "aws.codepipeline" detail-type: - "CodePipeline Pipeline Execution State Change" - "CodePipeline Stage Execution State Change" - "CodePipeline Action Execution State Change"
All that remains is to create the CloudWatch Dashboard to visualize the metrics we are now capturing. Although CloudWatch Dashboards can be created with CloudFormation, we are unable to define dynamic dashboards that list metrics about all the active pipelines in our account. Therefore, we have another Lambda function that is triggered by CloudWatch Scheduled Events to generate the dashboard as shown in Figure 2.
DashboardGeneratorFunction: Type: 'AWS::Serverless::Function' Properties: FunctionName: pipeline-dashboard-generator Description: Build CloudWatch dashboard from CloudWatch metrics Handler: index.generateDashboard Runtime: nodejs6.10 CodeUri: . Timeout: 60 Role: !GetAtt DashboardGeneratorRole.Arn Events: DashboardEventRule: Type: Schedule Properties: Schedule: "cron(*/5 * * * ? *)"
Dashboard in Action
The event runs every 5 minutes and rebuilds the list of pipelines in the dashboard based on the distinct list pipeline names on the metrics that has been recorded. The resulting dashboard shown in Figure 3 displays 5 metrics for each pipeline:
- What is it? This metric shows how often software is being delivered into production. This is more a measure of throughput than a measure of duration. In fact, it is the inverse of throughput, so if your pipeline is completing 8 deployments per day, then your Cycle Time is 3 hours (or 1/8 of a day).
- How is it calculated? The mean interval of time between two consecutive successful pipeline executions.
- What is it used for? This metric is a good indication of the batch size of production releases. For example, a cycle time of 30 days is likely an indication of a pipeline that is delivering risky deployments to production that include a month’s worth of new software development.
- What is it? This metric shows how long it takes for a single commit to get all the way into production. Whereas Cycle Time measures time between multiple pipeline executions, Lead Time is looking at the time it takes for a single pipeline execution to complete.
- How is it calculated? The mean amount of time from commit to production, including rework.
- What is it used for? This is often the number the business cares about most, as it represents how long it takes for a feature to get into the hands of the customer. If this number is too large, there are two areas that could be the root cause. First, the execution time for the pipeline may be too long due to manual steps that should be replaced with automation. Second, look at improving the availability of the pipeline (MTBF / MTBF + MTTR) by reviewing the other metrics?(MTBF, MTTR, and Feedback Time).
Cycle Time vs Lead Time
One thing worth highlighting is the difference between Cycle Time?and?Lead Time as they are often conflated. I appreciate the simplicity with which Paulo Caroli defined them in his post, Continuous Delivery: lead time and cycle time.
- Lead time: the amount of time a work item takes from the beginning to the end of the workflow
- Cycle time: the interval of time between two consecutive work items leaving the workflow.
To help understand, let’s compare the two metrics in the following scenarios. Notice that?Lead Time?is the same for the pipelines in both scenarios, however the cycle time is much smaller in Figure 5 due to the fact that the second commit happened sooner and the pipelines are running in parallel:
To understand the significance of the additional metrics (MTBF, MTTR, and Feedback Time), it helps to visualize the metrics in the context of the continuous delivery pipeline itself:
- What is it? This metric shows how often does the pipeline fails. This provides a good indication of how often the pipeline fails due to defective commits or external system dependencies like test data.
- How is it calculated? The mean interval of time between the start of a successful pipeline execution and the start of a failed pipeline execution.
- What is it used for? This number should be high in comparison to?MTTR to ensure the pipeline has a high availability as calculated by (MTBF / MTBF + MTTR). If this number is low, then consider improving the reliability of the pipeline by first researching if the root cause is the quality of new code being committed, or the repeatability of the infrastructure and test automation.
- What is it? How long does it take to fix the pipeline.
- How is it calculated? The mean interval of time between the start of a failed pipeline execution and the start of a successful pipeline execution.
- What is it used for? This number should be low as it is a measure of a team’s ability to “stop the line” when a build fails and swarm on resolving it. If the?Feedback Time?is high, then consider addressing that first, otherwise the issue is with the team’s responsiveness to failures.
- What is it? How quick can we identify failures in an automated manner.
- How is it calculated? The mean amount of time from commit to failure of a pipeline execution.
- What is it used for? This number should be low since it affects?MTTR. Ideally, failures would be detected as early as possible in the pipeline, rather than finding them farther along in the pipeline.
When you are ready to launch the solution in your own account, click the “Launch Stack” button for the region you want to launch the stack into and follow the prompts.
Once the CloudFormation stack is in the CREATE_COMPLETE state, go ahead and trigger an execution of one of your existing CodePipelines. Upon completion, navigate to CloudWatch Dashboards and click on the Pipelines dashboard to view the results. The longer you allow the dashboard to collect data, the more interesting the dashboard will become.
Remember, it is never enough to just build solutions that ought to benefit your organization. You must also measure results and make corrections to ensure the solution does realize the benefits you expected!
Here are some of the resources referenced in this post:
- Source code for this post
- Get Notified on AWS CodePipeline Errors
- AWS Serverless Application Model
- Continuous Delivery: lead time and cycle time
Stelligent Amazon Pollycast