Stelligent

Abort, Rollback…Retry? Upcoming updates to CloudFormation

Stack rollback paused notification with option buttons

Figure 4. Stack rollback paused notification with option buttons

Upcoming updates to CloudFormation to enable faster and more productive development. 


Cloudformation templates are incredibly expressive, providing the ability to automate resource creation and configuration of many AWS services and create custom resources to accomplish an unlimited number of tasks. Cloudformation, by default, treats these stacks as single units while they are being created. If any portion of a stack fails to be created, the entire stack must be brought down before another attempt at creating the resources is made. Since stack deletion can often take some time, it’s not hard to spend more time rolling back the associated resources than it took creating the initial template or stack changes. Many engineers including Stelligent engineers found it useful to create an empty stack so that when resources failed, CloudFormation rolled back to an empty stack rather than waiting for the entire stack to delete.  See our blog post on that topic here.

Recognizing the issue, AWS has just introduced a new change in how failed resource provisioning is handled in the AWS CloudFormation Console and API, and we got to help them test it. This new behavior provides greater flexibility in dealing with failures for both new stack creation or stack updates. You can now choose to retry or update without having to tear down any successfully created resources.  This new approach will make template development more efficient, productive, and rewarding. As for Stelligent, we love it.

What tools are available to fix stack creation failures now?

To get a really clear understanding of this new approach, let’s look at the changes using the AWS CloudFormation Console. With the old behavior, the only option for handling stack creation failures was in the Advanced options menu (Figure 1) where you can only choose to roll the stack back in the case of a failure or not roll it back.  The choice was all or nothing, either enabled or disabled.

Figure 1. Previous Rollback on failure options

To emphasize the impact of rolling back on stack creation failure, let’s look at an example. In this example, we’ll use a fairly simple template that provisions an S3 bucket, a lambda function and a log group along with the necessary permissions and configurations. The template contains an error ending in a stack creation failure with the resulting Console Events output (Figure 2).

Figure 2. Console results of a failed stack creation

Reviewing the events reveals that three resources were successfully created and provisioned before the error was encountered. However, because of the previous rollback behavior, all resources are deleted once a stack creation fails. In this case, we had to wait for the resources to be deleted before we could start to troubleshoot and when we trigger the creation again, we’ll have to wait for the exact same resources to be created all over again.  And the cycle of create, fail, delete will continue until the template is successfully debugged.

What is changing?

So, what’s different with the new functionality around failed stack creation and resource provisioning?

As already mentioned, this new approach of handling stack provisioning failures is quite different and powerful for the template developer, so let’s take a closer look.   First off, the new functionality has been moved to the main Configure stack options menu, contained in a new option box called Stack failure options. (Figure 3).

Figure 3. New Stack Failure options

The two available choices are pretty self-explanatory. If you select the first option (the default), it behaves similarly to the previous default behavior of rolling back all stack resources if there is a failure. The second option is the one new one that offers big benefits. With this one selected, any resources that were successfully created will remain provisioned. Did you catch that? No more waiting around creating and recreating successfully provisioned resources. Keeping those provisioned resources allows the developer to focus on debugging only the failed resources, significantly reducing the debug cycle time. 

As before, let’s take a look at what a stack failure looks like. We’ll use the same example stack as before and select the Preserve successfully provisioned resources option. After loading the example template and attempting to create a stack, we can easily see the differences in the Console. First off, you’ll notice a new dialogue stating that the Stack rollback is paused along with different options for us to take next (Figure 4).

Figure 4. Stack rollback paused notification with option buttons

Secondly, looking at the Events list (Figure 5), you can see that any successfully provisioned resources up until the error was encountered have not been deleted.

Figure 5. Console results of a failed stack creation with the new rollback behavior

At this point, we get to choose what action to pursue and our choice can depend on the error encountered. For example, in the situation above where the stack creation failed due to an error in the template, debug the failure, make a change to the template, and then select Update to use the fixed template and continue with the stack creation.  However, in another situation where resource creation was failing due to AWS temporarily not having enough resources, the option could be choosing Retry with the existing template. With this new behavior, you now have new development approaches which save time and offer more flexibility.

Updates also coming to CLI, default behavior not changed

And what’s even better? Well, not only does this new behavior apply to Change Sets, but it is also available via the CloudFormation API. When using the create-stack, update-stack or execute-change-set API commands, include the –disable-rollback parameter in the command. There is also an additional command, rollback-stack, to rollback resources to the stack.

It is important to remember that the default behavior of rolling a stack back to its last known good state will not change. This is significant because this is the recommended configuration for production stacks, and many developers have configured their pipelines based on this. This is also good for other environments where ‘all or nothing’ updates are needed. Other options may be helpful for updates that cannot be retried, like using a null stack. 

In conclusion, with the new Behavior on provisioning failure changes, dealing with failures when creating new stacks or updating existing stacks is simplified and much more efficient. Developers can focus on only the failed resources and no longer wait for successfully created resources to be torn down and then reprovisioned.  This means template development issues, ranging from small typos to permissions and resource limits, no longer result in long periods of switching context away from the task at hand. We expect that this much more consolidated workflow will enable all teams to accomplish more with their existing resources, and we can’t wait to be amazed by the next innovations AWS comes up with.

 


Resource Links:

Stelligent Amazon Pollycast