This article covers one method of automatically creating CloudWatch Dashboards for several resources types, while supporting arbitrary grouping. Working knowledge of Terraform 0.12.x is advised.

Here at Stelligent, we are all about shortening and otherwise improving the feedback loop between developers and users. We have spent a lot of time showing you how to automate your builds, your tests and other quality controls, and your deployment. In this article, we’re going to focus on the second half of the loop, which is monitoring. AWS CloudWatch is a collection of several great native monitoring solutions. CloudWatch Dashboards is just one of them. I’ll show you how to make sure your CloudWatch Dashboards always include all of your resources in their graphs.

The code samples are taken from the repository accompanying this article:
https://github.com/stelligent/dynamic-cloudwatch-dashboards-by-tag

Anatomy of a Dashboard

Examples:

A quick Google search of “CloudWatch Dashboard Example”, shows at a glance what is possible without installing any extra software:

boto3 put_dashboard() call

Like many software engineers, I am usually suspicious of slick presentations, and want to see more behind-the-scenes detail. So, here it is, from the Boto documentation:

That’s it. The DashboardBody is a JSON object, converted to a string. This JSON object is where all the fun is, and AWS has full documentation for it. Since AWS is all about the API, you can do the same thing from the AWS CLI program, or the AWS SDK for other languages like Go, Node or Java. We’ll come back to this foundational API in a later section, “Programmatic Creation of a Dashboard”.

Each Dashboard is composed of an array of widgets, each with an (x,y) coordinate pair, and a height/width pair. The width of a dashboard is divided into 24 equally-sized units. This means you can easily go for a 2-column grid (2×12), a 3-column grid (3×8), a 4-column (4×6), or any other variation. The combined height of the widgets can be as long as you need, because the console will scroll vertically as needed.

In this example, we’re going to go for a simple grid of identically-sized graphs with a single static/markup tile.

A single tile is represented by a single JSON object in an array called “widgets”:

With each tile, you specify the position within this 24-wide/unlimited-height grid, and size of each tile. Obviously, you need to take care that the widgets do not overlap. The console is a big help in that, but programmatically, when you generate the x/y/width/height values, it’s up to you.

The Properties object is where you get to specify which kind of graph, which service, which metrics, and which resources to graph. The Terraform documentation for a CloudWatch dashboard also has a good example.

Preview of our destination

What we’re going to produce in this article are large, service-appropriate, and repetitive JSON dashboard bodies, that include dozens of metrics for resource), similar to this: (very long, so not using a gist for this one). And we want to generate these dashboards each time a resource gets created, deleted, or tagged.

Definitely a job for automation, wouldn’t you say? Yes, but back to baby steps first.

Iteratively Creating A Dashboard

Fortunately, the CloudWatch Dashboard console lets you interactively compose the dashboard, and show you the JSON body behind what you created. This is similar to when the IAM console lets you compose PolicyDocuments, then gives you a small JSON snippet that you can paste into your CloudFormation, Terraform, and boto3 programs.

To illustrate, go to your AWS CloudWatch Console, and click on the Dashboards menu on the left, and then click the blue “Create Dashboards” button.

Give the new dashboard a temporary name.

Choose a graph type (line chart), and click on Continue.

Then choose a service name, such as S3.

Then choose a metric group (Storage Metrics)

Then choose any metric, any resources (preferably more than one), and then click on the Source tab.

And the source tab reveals what the `properties` object will look like. The “.” values you see in the second element of the “metrics” array just mean “repeat what is in the previous element”. This shorthand keeps the size of the put_dashboard() API requests down when specifying a large number of resources.

You can repeat the above steps as many times as needed to add more widgets.

When you’re done creating widgets, you can view the JSON requires for all the widgets you have on the dashboard:

In this window, you’ll see all of your widgets together as an array. This represents the entirety of the JSON payload that goes into the put_dashboard() API call.

Programmatic Creation of a Dashboard

So now that we understand the JSON argument to put_dashboard(), let’s examine the python logic to apply the metrics of your choice to all the resources matching naming criteria, using tag-driven arbitrary grouping.

Repository Overview

The contents of the github repo are arranged as follows:

Keeping your dashboards up to date in real-time

Hosting programmatic dashboard-updater code in Lambda

The overall solution diagram is rather simple:

CloudWatch Events to trigger the Lambda

In order to keep your dashboards up to date, they need to be notified each time a supported resource is created, deleted, or has its tags updated or deleted. CloudWatch Events does exactly that, and the code to launch the Lambda into AWS with all the CloudWatch Event trigger attached to it are in the “./terraform” directory.

The five resource types supported by this sample Lambda are:

S3 buckets
Lambda functions
Rest APIs (from API Gateway)
DynamoDB Tables
CloudFront distributions

Understanding the Input: CloudWatch Event
In the repo’s tests directory, there are some sample CloudWatch events. Here are the important parts of the tests/bucket_put_event.json file, as an example.:

Because we are only sending the specific event types for creating/deleting/tagging the resources in the terraform/cloudwatch_*.tf files, In fact, because of the arbitrary grouping by the CloudwatchDashboardBasenames tag, we don’t even need to pay attention to the specific resource name. Really, all we need to know for which resource type is affected. For example, if we know an S3 bucket got created, we’re going to retrieve a list of all the buckets, group them by the tag name, and call put_dashboards() for all of the buckets.

The same logic applies for the other four resource types.

Layout of the update_dashboards() Lambda:

Lambda Code Walkthrough

The main logic is in update_dashboards.py. It imports a bunch of service-specific `get_dashboard_<service>` methods from helper libraries, maps each one to a Cloudwatch service name, and invokes the correct helper functions. The helper functions identify the resources for that type that match the {prefix}-{env} naming conventions, and generate the dashboard bodies using all matching resources, all grouped by the CloudwatchDashboardBasenames tag. There are several reasons why there is one helper for each service:

The method names, parameter names for the various boto3 client differ greatly (list, get, tagging).
The property names in those boto calls’ responses are similarly inconsistent.
Each helper function has an array of metrics that apply to that service.

The above is the read-only part of the logic.

Fortunately, once we have those JSON bodies, the put_dashboard() and delete_dashboard() calls are identical, regardless of service.

It even cleans up dashboards that no longer have any resources to track (because of a tag change or resource deletion).

Please refer to the helper function source files if you like. These are the files you want to modify if you want to change the metrics, time periods, layout or graph types.

Deploying the update_dashboards() Lambda

Prerequisites:

Terraform installed (at least 0.12.24)
A Linux-like execution environment with Python3.7 or 3.8 installed
An active AWS account, with permissions to create roles/policies, Lambda functions and permissions, CloudWatch Event rules, etc.

Run ./deploy.sh

Testing the update_dashboards() Lambda

Navigate to the CloudWatch Dashboards page in the AWS Console.

For each of the terraform-* directories (one at a time),

Change to that directory
Inspect the resources and tags in those Terraform files, and tweak them if desired
(i.e. add/subtract additional resources and/or change the values for the CloudwatchDashboardBasenames tags)
Run: terraform apply -auto-approve
Refresh the CloudWatch Dashboards page, and note the Dashboards that potentially track multiple resources.
When you’re done with a terraform-* directory, you can destroy those test resources just as easily: terraform destroy -auto-approve

Undeploying the update_dashboards() Lambda

When you’re finished playing with the terraform-* sample directories, you can undeploy the update_dashboards Lambda itself:

Run ./destroy.sh

So, that’s the idea, the implementation, and instructions.

I hope you learned something and are inspired to built upon these ideas.

But before you go, I have two fun tips to share, one about Terraform, and the other about Terraform. I learned them both while writing this article, and as we say at Stelligent: “Sharing is Caring”.

Extra Terraform Tip: One-shot Lambda deployment

AWS CloudFormation famously cannot access your local filesystem, as it runs entirely on AWS’s servers. The only thing the AWS CLI can help with is sending your template files and parameter overrides via the boto3 library, but it doesn’t send your Lambda ZIP files. Those have to be copied to an already-created S3 bucket. What we need, but CloudFormation cannot do, is this:

Create an S3 bucket
Upload Lambda ZIP file to bucket
Deploy Lambda ZIP into AWS Lambda service

However, because Terraform runs on your system, it can read your local ZIP file. In fact, it can create your zip file too, via the archive_file data lookup. Even better, the lambda_function resource in Terraform takes a filename parameter directly, so you don’t even need to create an S3 bucket explicitly. The ./terraform directory in the accompanying repository is an example of this.

Extra Python Tip: ZIP packaging for multiple entry points

I randomly stumbled across this:

The python environment itself provides this feature as the `zipapp` module, but the same effect can be easily achieved using any program that can create zip files, such as the archive_file data lookup provided by the core Terraform product (not the AWS provider).

To speed up my dev cycle, I decided to use this feature to provide a CLI interface to the usual lambda_handler() function.

Java developers will recognize this idea, of having a particular piece of code be the entrypoint of an archive file. After all, jar files are just ZIP files.

So python3, upon being passed a ZIP filename, finds and runs this __main__.py file within:

From cloudwatch_dashboards/cloudwatch_dashboards.py

And of course, update_dashboards() is the main logic we covered near the middle of this article.

Stelligent Amazon Pollycast