Security Response and Remediation Automation on AWS

Consider the following questions when it comes to the typical approach to security error detection and remediation in many enterprises today:

How does security error detection occur?
When and how often does it occur?
Who is usually involved in fixing these security issues?
What is the approach to fixing these issues?
How long does it usually take to fix the security issues?
What if security issues are never discovered or acted upon?

In many enterprises, security detection is performed through passive logging and monitoring. That is, it relies on humans to notice alerts and then do something about it. These might be alerts for things like unencrypted data, not enabling MFA when creating a user, not enabling a security service, or other non compliant resources.

While the detection can happen in near real-time, it doesn’t mean that an engineer notices the problem immediately or if or when they will intervene to remedy the problem. This is because these systems often rely on humans/engineers to notice the alerts and make it a priority to fix the errors. These alerts (which are often aggregated into a Security Information & Event Management system) might be first flagged by members of the Security Operations Center (SOC) or other team members. After an alert is seen, it might get prioritized into a queue through a ticketing system to be resolved.

The remediation is often managed through these same ticketing systems. Engineers will manually fix the issue or apply a code fix after the detection and resolve the ticket. Many times these are “one-time fixes” because they aren’t codified and/or made part of the canonical environment to prevent the issue from happening again.

In most organizations, it can take hours, days, or more to detect a security anomaly. What’s more, it might take days or even weeks to fix the problem. Even worse, it might never detect or fix the problem at all!

In this blog post, I’ll be describing an approach to solving many of the problems I just presented. The solution leverages event-driven architectures for security response automation at scale.

DevSecOps

Before getting into event-driven architectures for security response automation, consider a traditional development process in which you build, test, and release software to customers. On these traditional teams, it’s usually a slow and arduous process of getting feedback from customers. This is often the result of organizational, process, cultural, and tooling barriers that throttle this feedback. This isn’t necessarily intentional but more the result of organizational inertia that has built up over time in which there isn’t a focus on speeding up effective feedback between customers and developers.

When effectively applying DevOps practices, you compress the time by which developers get this feedback from customers while increasing the quality by breaking down organizational silos, treating everything as code, creating fully-automated workflows that build, test, deploy, and release software to production whenever there’s a business need to do so. By getting regular, effective feedback from customers, developers are more likely to build features the customers want.

Figure 1 – DevSecOps is about accelerating feedback loops in the safest manner

How fast you’re able to get through this feedback loop determines how responsive you can be to customers and how innovative you are. From your customer’s perspective, you are only delivering value when you’re spending time on developing high-quality features they want. Ultimately, DevOps is any organization, process, culture, or tooling changes that help speed up these effective feedback loops between customers and developers.

DevSecOps is about accelerating these feedback loops in the safest manner that reduces risk and increases quality and confidence. Keep this in mind that the speed of quality feedback matters to building better and more secure products for customers.

Event-Driven Architectures

There are three parts to an event-driven architecture: an event producer, an event router, and an event consumer. In the context of a security response, there may be many producers of security events including logs, threat analysis, inspections, configuration changes, and more. These producers emit events that are often described in JSON objects. What could be thousands or even millions of events get asynchronously pushed to an event router. This router ingests, filters, and routes events to the appropriate consumers. The events received from the event router trigger actions to occur such as running code that automatically fixes noncompliant resources or begins a remediation workflow. An example of this architecture and related AWS services is shown in Figure 2.

Figure 2 – Event-Driven Architecture for Security

Event-Driven Security Detection and Remediation on AWS

In the context of security response automation on AWS, a security event is triggered by a change. For example, AWS Config detects a configuration change and runs AWS Config Rules. These config rule events are passed to an event router like Amazon EventBridge Rules. EventBridge pushes the event to a responsive control such as AWS Lambda, AWS Systems Manager, or AWS Step Functions. These responsive controls perform the automated remediation to fix the problem. In this scenario, you can go from detection to remediation in minutes.

AWS Config produces events based on configuration state changes. With Config, you can see changes to resources and their relationships with other AWS resources. You can write Config Rules that compare current resource state to the desired state. Figure 3 shows a noncompliant config rule has been detected in the AWS Config Console. These config rules can integrate with other services for alerting and remediation.

Figure 3 – Noncompliant resource in AWS Config Console

Amazon EventBridge provides real-time access to data changes in AWS services and other sources. You can choose an event source and target to run based on event patterns. EventBridge integrates with many different AWS services in which over 40 incoming event types can be analyzed by EventBridge and then run up to five targets for an event rule. Currently, there are 15 target types you can run.

For example, imagine someone is creating a new S3 bucket without enabling encryption. Config Rules runs code in a Lambda function to determine if this S3 resource is noncompliant. If it’s noncompliant, it produces noncompliant events. EventBridge Rules filter on noncompliant Config Rules; in this case, s3-bucket-server-side-encryption-enabled. When it discovers this noncompliant rule, EventBridge pushes the event to a Lambda function target that automatically remediates the noncompliant resource. Figure 4 shows an example workflow and code snippets.

Figure 4 – Security Response Automation Example

With AWS Lambda, you create functions that run your code in response to events. Lambda automatically manages the underlying compute resources (i.e., servers and their configuration). In this example, Lambda is used to detect whether a new S3 bucket has encryption enabled via Config Rules; if not, a Lambda function is run to automatically enable encryption on this S3 bucket.

Other Target Type Examples

AWS Systems Manager provides visibility and control of your infrastructure on AWS. In the context of security response automation, you can write automated actions using Systems Manager Documents that remediate security deviations.

AWS Step Functions is a serverless function orchestrator to sequence AWS Lambda functions, multiple AWS services, and manual actions into an integrated workflow. You can think of Step Functions as a “State Machine as a Service”. For example, you can use Step Functions to track and resolve a security incident by running automated and manual actions in which someone might need to update a security ticket, maybe write some code or change configuration. With Step Functions, you can define a fully automated and integrated workflow until the security incident is resolved.

For example, in a security response automation workflow, AWS Config is producing events based on configuration state changes. Config Rules runs code in a Lambda function to determine if a resource is noncompliant. If it’s noncompliant, it produces noncompliant events. EventBridge Rules filter on specified noncompliant Config Rules. If it discovers a noncompliant rule, it runs a Step Functions workflow that remediates an IAM Policy and, separately, allows an engineer to undo the change. Figure 5 shows a visualization of a Step Functions workflow for this particular security use case.

Figure 5 – Step Functions Security Remediation Workflow

Continuous Security on AWS

Continuous Delivery enables teams to release recent changes whenever there’s a business need to do so. With Continuous Delivery, the build, test, deploy, and release processes are automated for any application, configuration, infrastructure, data, or other changes. Continuous Security applies this Continuous Delivery approach to security tests, analysis, and the provisioning of security services as well.

AWS CloudFormation can be used to provision all of the deployment pipeline resources as code. This includes AWS CodePipeline and AWS CodeBuild. AWS CodePipeline is a managed service for orchestrating release workflows for continuous delivery. AWS CodeBuild is a managed service for running build and test actions. An example CodePipeline deployment pipeline is shown in Figure 6.

As part of the deployment pipeline, it launches a separate CloudFormation stack that provisions all of the services that make up the security response automation solution. Since it’s defined in a CloudFormation template, it’s code that can be versioned, tested, and so forth.

In the context of event-driven security, this pipeline would provision the AWS security and related services such as EventBridge, Config Rules, and the Lambda functions that run the detection and remediation.

Once defined, CodePipeline gets the latest changes from your source control repository and runs a build that builds Lambda functions, it runs preventative checks against your CloudFormation templates such as whether you have secrets in your code. All of these checks can run before launching the infrastructure itself. You could use a pipeline like this to provision other security services such as Amazon GuardDuty or Amazon Macie, AWS Config, AWS Systems Manager, or AWS Step Functions. You can define all that as code and then actually deploy it as a part of your pipeline, and then you can run production tests, canaries once they’ve been deployed.

Figure 6 – Deployment Pipeline for Security Resource in AWS CodePipeline

Event-Driven Security Cultural Practices

You don’t want to incorporate this type of automation without having conversations across different teams in your organization. This is especially true in enterprises. When done well, everyone has a part to play in applying security as code or event-based security – at scale. Below, I’ve listed some key practices to consider when applying security as code.

Codify All The Things Including Security – This provides engineers the ability to recreate systems and quickly fix problems.
Commit Often to a Single Source of Truth – Engineers can recreate systems from a single command and leads to less complex troubleshooting.
Single Path to Production – Reduce the risks of insecure code. Get feedback early in the development lifecycle. Reduce assumptions. Get regular feedback from end users.
Stop the Line – When a failure occurs, the top priority becomes fixing the error. Reduce complexity and the cost of failures. Prevent bottlenecks.
Security is Everyone’s Responsibility – Communicate practices across teams. Enable application teams to apply security while ensuring governance across an enterprise.
Security Tests in Pipelines – Identify security issues early in development process.
Least Privilege – Provide the minimum necessary permissions required to perform function.
Defense in Depth – Apply multiple layers of security controls as code.
Visible Feedback – Make faster, better decisions.
Self-Service Everything – Reduce engineer bottlenecks and frustration.

Summary

I discussed how, in many enterprises, security detection and response requires human intervention. This delays the response between detection and remediation to days, weeks, or not at all. With security response automation, you can leverage event-driven architectures to quickly detect security events and fix them with code. You saw examples of services on the AWS platform that are used as security event producers, routers, and consumers. AWS IAM, S3, and AWS Config Rules are examples of event consumers. Amazon EventBridge Rules acts as a security event router, and AWS Lambda, Systems Manager, and Step Functions consume these filtered security events in order to perform automated remediations.

Stelligent Amazon Pollycast