Application Auto Scaling with Amazon ECS

In this blog post, you’ll see an example of Application Auto Scaling for the Amazon ECS (EC2 Container Service). Automatic scaling of the container instances in your ECS cluster has been a feature for quite some time, but until recently you were not able to scale the tasks in your ECS service with built-in technology from AWS. In May of 2016, Automatic Scaling with Amazon ECS was announced which allowed us to configure elasticity into our deployed container services in Amazon’s cloud.

Developer Note: Skip to the “CloudFormation Examples” section to skip right to the code!

Why should you auto scale your container services?

Efficient and effective scaling of your microservices is why you should choose automatic scaling of your containers. If your primary goals include fault tolerance or elastic workloads, then leveraging a combination of cloud technology for autoscaling and infrastructure as code are the keys to success. With AWS’ Automatic Application Autoscaling, you can quickly configure elasticity into your architecture in a repeatable and testable way.

Introducing CloudFormation Support

For the first few months of this new feature it was not available in AWS CloudFormation. Configuration was either a manual process in the AWS Console or a series of API calls made from the CLI or one of Amazon’s SDKs. Finally, in August of 2016, we can now manage this configuration easily using CloudFormation.

The resource types you’re going to need to work with are:

The ScalableTarget and ScalingPolicy are the new resources that configure how your ECS Service behaves when an Alarm is triggered. In addition, you will need to create a new Role to give access to the Application Auto Scaling service to describe your CloudWatch Alarms and to modify your ECS Service — such as increasing your Desired Count.

CloudFormation Examples

The below examples were written for AWS CloudFormation in the YAML format. You can plug these snippets directly into your existing templates with minimal adjustments necessary. Enjoy!

Step 1: Implement a Role

These permissions were gathered from the various sources in AWS documentation.

ApplicationAutoScalingRole:
  Type: AWS::IAM::Role
  Properties:
    AssumeRolePolicyDocument:
      Statement:
      - Effect: Allow
        Principal:
          Service:
          - application-autoscaling.amazonaws.com
        Action:
        - sts:AssumeRole
     Path: "/"
     Policies:
     - PolicyName: ECSBlogScalingRole
       PolicyDocument:
         Statement:
         - Effect: Allow
           Action:
           - ecs:UpdateService
           - ecs:DescribeServices
           - application-autoscaling:*
           - cloudwatch:DescribeAlarms
           - cloudwatch:GetMetricStatistics
           Resource: "*"

Step 2: Implement some alarms

The below alarm will initiate scaling based on container CPU Utilization.

AutoScalingCPUAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    AlarmDescription: Containers CPU Utilization High
    MetricName: CPUUtilization
    Namespace: AWS/ECS
    Statistic: Average
    Period: '300'
    EvaluationPeriods: '1'
    Threshold: '80'
    AlarmActions:
    - Ref: AutoScalingPolicy
    Dimensions:
    - Name: ServiceName
      Value:
        Fn::GetAtt:
        - YourECSServiceResource
        - Name
    - Name: ClusterName
      Value:
        Ref: YourECSClusterName
    ComparisonOperator: GreaterThanOrEqualToThreshold

Step 3: Implement the ScalableTarget

This resource configures your Application Scaling to your ECS Service and provides some limitations for its function. Other than your MinCapacity and MaxCapacity, these settings are quite fixed when used with ECS.

AutoScalingTarget:
  Type: AWS::ApplicationAutoScaling::ScalableTarget
  Properties:
    MaxCapacity: 20
    MinCapacity: 1
    ResourceId:
      Fn::Join:
      - "/"
      - - service
        - Ref: YourECSClusterName
        - Fn::GetAtt:
          - YourECSServiceResource
          - Name
    RoleARN:
      Fn::GetAtt:
      - ApplicationAutoScalingRole
      - Arn
    ScalableDimension: ecs:service:DesiredCount
    ServiceNamespace: ecs

Step 4: Implement the ScalingPolicy

This resource configures your exact scaling configuration — when to scale up or down and by how much. Pay close attention to the StepAdjustments in the StepScalingPolicyConfiguration as the documentation on this is very vague.

In the below example, we are scaling up by 2 containers when the alarm is greater than the Metric Threshold and scaling down by 1 container when below the Metric Threshold. Take special note of how MetricIntervalLowerBound and MetricIntervalUpperBound work together. When unspecified, they are effectively infinity for the upper bound and negative infinity for the lower bound. Finally, note that these thresholds are computed based on aggregated metrics — meaning the Average, Minimum or Maximum of your combined fleet of containers.

AutoScalingPolicy:
  Type: AWS::ApplicationAutoScaling::ScalingPolicy
  Properties:
    PolicyName: ECSScalingBlogPolicy
    PolicyType: StepScaling
    ScalingTargetId:
      Ref: AutoScalingTarget
    ScalableDimension: ecs:service:DesiredCount
    ServiceNamespace: ecs
    StepScalingPolicyConfiguration:
      AdjustmentType: ChangeInCapacity
      Cooldown: 60
      MetricAggregationType: Average
      StepAdjustments:
      - MetricIntervalLowerBound: 0
        ScalingAdjustment: 2
      - MetricIntervalUpperBound: 0
        ScalingAdjustment: -1

Wrapping It Up

Amazon Web Services continues to provide excellent resources for automation, elasticity and virtually unlimited scalability. As you can see, with a couple solid examples underfoot you can very quickly build in that on-demand elasticity and inherent fault tolerance. After you have your tasks auto scaled, I recommend you check out the documentation on how to scale your container instances also to provide the same benefits to your ECS cluster itself.

Deploying Microservices? Let mu help!

With support for ECS Application Auto Scaling coming soon to Stelligent mu, it offers the fastest and most comprehensive platform for deploying microservices as containers.

Want to learn more about mu from its creators? Check out the DevOps in AWS Radio’s podcast or find more posts in our blog.

Additional Resources

Here are some of the supporting resources discussed in this post.

We’re Hiring!

Like what you’ve read? Would you like to join a team on the cutting edge of DevOps and Amazon Web Services? We’re hiring talented engineers like you. Click here to visit our careers page.

 

 

Enforcing Compliance with AWS Organizations

You have a large organization with several development teams that work on various software projects that support your business. A year ago, you brought in a consultant that told you to use multiple AWS accounts because there were benefits to be gained. For example, using multiple accounts we can contain the damage from a possible security breach and isolate work by teams so that others don’t inadvertently disrupt that work. But there are also issues that we must deal with.
When a company has more than one AWS account and especially many AWS accounts, it becomes difficult to manage those accounts. How do we know that all teams are using good security policies? How do we take advantage of billing incentives for using more and more of an AWS resource? How do we manage the billing in general for all of those accounts? And if a company is in a business that requires them to comply with a set of standards such as PCI or HIPAA, how can we guarantee that teams are using only services that are certified compliant? And how can we automate the creation of new accounts in a way that they are properly configured, to begin with?

What Are AWS Organizations?

AWS organizations allow companies with multiple AWS accounts to manage those accounts from a billing and administrative perspective from a single root account. Why is this important? Until Organizations came along, I like to think of having multiple accounts as being like the Wild West. Each account was on its own and there was no way to manage all of them from one place. Users had no way to apply policies, manage permissions, or manage billing from a “company” perspective. AWS Organizations give us the tools we need to bring these accounts together and control them all in a predictable way.

Service Control Policies (SCPs)

Service Control Policies allow us to define the services that an account can access. In our case, we know that we want to allow access to only the services that are HIPAA compliant. Any service that isn’t compliant should not be allowed to be used by the teams. Using the root account, we can push this policy out to all accounts that we have within our organization.

Organizational Units (OUs)

Most organizations have accounts that have different requirements. Using the example above, some accounts may have to be HIPAA compliant while others may be used for other purposes and do not have to follow any guidelines. AWS Organizations gives us the ability to group accounts into Organizational Units.

Organizational Units allow us to split our accounts into separate groups and apply different policies to those groups. Continuing with the example from above, we can have an OU for all accounts that must be HIPAA compliant and an OU for accounts that are general purpose. All accounts in the HIPAA OU will be restricted to only the services that are HIPAA certified while the accounts in the general purpose OU have access to all AWS services. The rules that are applied to an OU even overrule account administrators. If an admin accidentally logs into an account and specifically sets permissions in that account to allow access to a service that has been restricted at the OU level, the OU rule that was applied to the account will still block that access.

OUs can be up to 5 levels deep. You can have multiple OUs inside of an OU. This allows even more granular control over accounts. As an example, let’s assume that some of our HIPAA accounts also handle patient transactional data. This means that we are dealing with both PCI and HIPAA data in those accounts. We can create an OU inside of our HIPAA account that restricts access to only services that are PCI compliant. The result is at the first level we have accounts that can only access HIPAA compliant services. In the PCI OU under the HIPAA OU we have accounts that can only access services which are HIPAA compliant AND PCI compliant.

One thing that must be remembered is that the root or “master” account cannot be restricted. Even if it is placed within an OU, none of the AWS services will be restricted to this account. Therefore, it is essential that the root account is not used by anyone other than the administrator of all accounts.

Account Creation Automation

It is often the case that a company will grow and will add teams as they are needed. These new teams will sometimes need their own set of accounts to work in to avoid disrupting the work of other teams. AWS Organizations provides the ability to automate this task. We can create an account, attach policies to this account, and add this account to the appropriate group all through the Organizations API. Not only is this useful for new teams, but it is also useful when developers need test accounts that need to be created quickly, then deleted when work within that account is finished.

How Does All of That Help Me?

Let’s take a look at an example and apply the tools above to solve the problems that companies with multiple accounts face. Let’s assume we have a health care company with a wide range of systems under their control. Some systems house identifiable patient data, which requires those systems to be HIPAA compliant, and some systems simply house generic data that can be used to generate high-level reports. The latter systems do not require any special treatment. One other platform the company has allows patients to log in and make payments. This platform allows users to store their credit card data for future transactions, which means the services they use must be PCI compliant.

Where Do We Start?

Before we begin we need to gather our requirements. We know that our company must be both HIPAA and PCI compliant so we can start by breaking the teams down into groups of standards they must follow.

Compliance Number of Teams
HIPAA 9
PCI 7
HIPAA and PCI (These overlap from the previous groups) 4
None 3

Once we have our teams broken out into groups, we need to know how many accounts each team has. Or this example, we are going to assume each team has 4 accounts:  Dev, Test, QA, PROD. Note that we have a group of 4 accounts that overlap in service restriction requirements. Unfortunately, Organizations will not allow an account to belong to 2 Organizational Units that are at the same hierarchical level. We will discuss the details of how to achieve this later when we create our OUs and begin adding accounts to them.

Once we have our accounts grouped we are ready to start planning our organization. The resulting Organization will have this overall structure:

cloudcraft - AWS Orgs

LIMITATION ALERT:

It’s worth noting at this point that AWS organizations treat accounts differently depending on how they were originally created. The Organizations API provides the ability to remove an account from the Organization, but only if that account was invited to join the organization. If the account was created by the organization, that account cannot be removed from the organization without deleting the account entirely. The Organizations API also does not provide the ability to delete an account, no matter how it was created. To delete an account, you must log into that account and do that manually. These limitations may influence how companies want to handle bringing accounts into an organization.

One other important fact we need to know is that the account that owns the user we use to create the Organization will become the master account. Make sure never to create an Organization from an account that needs to have policies applied to it. A master account will always have “root” access, even if it is moved to an Organizational Unit that restricts services. The services of the master account cannot be restricted and the wide-open policies will always override anything that is more restrictive.

Once we have our account information, let’s move on to creating the organization.

Creating an Organization

Before we begin, we need to make sure we have the AWS Command Line tools installed on the OS of your choice. Organizations can also be managed using the AWS SDK for your language of choice, but we’re going to use the command line tools for this example. Again, make sure we are using a user from the account we want to be the master. Make sure that user is configured with your CLI tools. Once our configs have been verified, we can issue the following command:

Minimum permissions for your user:

  • organizations:CreateOrganization
aws organizations create-organization --feature-set ALL

Notice that we are passing in a parameter to the create-organization command called “feature-set”. This tells AWS what control the organization will have over our accounts. There are 2 options we can pass in here:  ALL, CONSOLIDATED_BILLING. The ALL parameter value enables consolidated billing and also allows the organization to put policies in place that can restrict the services the account can access. This is the default value if this parameter is omitted. A value of CONSOLIDATED_BILLING will allow the new organization to consolidate the billing of all accounts under the master account. The Organization will not be allowed to restrict the services each account has access to. For our company, we need ALL functionality so we retain the ability to control access for some accounts to only HIPAA and PCI compliant services.

After running this command, we get back a response from AWS

{ "Organization": { "AvailablePolicyTypes": [{ "Status": "ENABLED", "Type": "SERVICE_CONTROL_POLICY" }], "MasterAccountId": "111111111111", "MasterAccountArn": "arn:aws:organizations::111111111111:account/o-exampleorgid/111111111111", "MasterAccountEmail": "bill@example.com", "FeatureSet": "ALL", "Id": "o-exampleorgid", "Arn": "arn:aws:organizations::111111111111:organization/o-exampleorgid" } }

We need to capture the “Id” value and keep that for future use.

Let’s Add Some Accounts

Inviting Accounts

Now that we have a newly created Organization, we can start adding our accounts to our organization. As mentioned above, there are 2 ways to add an account to an Organization. The first method and the one we’ll be using primarily for our example is to send an invitation to our accounts that already exist.

I want to reiterate that it’s important to note here that any account we invite to our Organization can be removed at any time. If we want our accounts tied to this Organization without the option to be removed (as a way of ensuring our policies are always in place), we need to create that account from within the Organization. Any resources would have to be migrated from the existing account to the new account.

To send an invitation to an existing account, we can issue the following command:

Minimum permissions for your users:

  • organizations:DescribeOrganization
  • organizations:InviteAccountToOrganization
aws organizations invite-account-to-organization --target '{"Type": "ACCOUNT", "Id": "ACCOUNT_ID_NUMBER"}'

We are passing in a data structure to the target parameter of the command. In this example, we are passing in the account ID. The key Type can also have values of EMAIL or ORGANIZATION. In those cases, we would set the Id to the appropriate value.

Another optional parameter that we could have passed as “notes”. If we want to include additional information in the email that is auto-generated by Organizations, we can pass that information using the “notes” parameter.

The response from this command should look like this:

{
  "Handshake": {
    "Action": "INVITE",
    "Arn": "arn:aws:organizations::111111111111:handshake/o-exampleorgid/invite/h-examplehandshakeid111",
    "ExpirationTimestamp": 1482952459.257,
    "Id": "h-examplehandshakeid111",
    "Parties": [{
      "Id": "o-exampleorgid",
      "Type": "ORGANIZATION"
    },
    {
      "Id": "juan@example.com",
      "Type": "EMAIL"
    }],
    "RequestedTimestamp": 1481656459.257,
    "Resources": [{
      "Type": "MASTER_EMAIL",
      "Value": "bill@amazon.com"
    },
    {
      "Type": "MASTER_NAME",
      "Value": "Org Master Account"
    },
    {
      "Type": "ORGANIZATION_FEATURE_SET",
      "Value": "FULL"
    },
    {
      "Type": "ORGANIZATION",
      "Value": "o-exampleorgid"
    },
    {
      "Type": "EMAIL",
      "Value": "juan@example.com"
    }],
    "State": "OPEN"
  }
}

Once again, we are interested in the “Id” value of the “Handshake” object. Each time we run the command to invite an account, we will receive this “Id” back in the response. We need to record that value for each account we invite so we can use it in the next step to accept the invitation.

Accepting Invitations

The process of inviting and adding an account to an organization is a “handshake” transaction. An invitation is sent to the account we want to add to our organization and the “owner” of that account must log in and accept that invitation. Fortunately for us, this can also be accomplished through the CLI. Again, we need to make sure our CLI is configured with a principal user that has the IAM permissions to accept that handshake. Once we have the CLI configured, we can issue the following command:

Minimum permissions for your user:

  • organizations:ListHandshakesForAccount
  • organizations:AcceptHandshake
  • organizations:DeclineHandshake
aws organizations accept-handshake --handshake-id HANDSHAKE_ID

The handshake ID that is being passed into this command was given to us in the response of the command to send the invitation.

Remember that we can also send and accept invitations through the console. For users with a few accounts, this may be acceptable. But if you are dealing with more than a few accounts you are definitely going to want to automate this process.

LIMITATION ALERT:

AWS has set a limit on a number of invitations that can be sent per day of 20. If you need to send more than that, contact customer support and they will up your limit.

Using Organizational Units

Here’s where the real power of Organizations starts to show. Now that we have our accounts added to the Organization we need to group them into OUs and restrict the services that can be used within those accounts. Before we started creating the Organization, we took the time to group our accounts by the compliance standard they needed to adhere to. We can use that information to help us create our OUs to move our accounts into. Looking at our chart we can see that we have four different types of accounts. We have HIPAA compliant, PCI compliant, HIPAA and PCI compliant, and accounts that require no restrictions at all. We are going to create three top-level OUs and one OU that is within either the PCI or the HIPAA OU. Because we are simply overlapping 2 sets of compliance standards, it really doesn’t matter which OU we use as a parent.

We’ll start by creating the three top-level OUs. We can issue the following commands to create those:

Minimum permissions for your user:

  • organizations:CreateOrganizationalUnit
aws organizations create-organizational-unit --parent-id PARENT_ORG_ID --name HipaaOU
aws organizations create-organizational-unit --parent-id PARENT_ORG_ID --name PciOU
aws organizations create-organizational-unit --parent-id PARENT_ORG_ID --name GeneralOU

We now have three top-level Organizational Units that we can add accounts to. We have already invited all existing accounts to our Organization. They reside at the top-level of our Org. To place those accounts into the proper OU we need to issue the “move” command on each account.

Minimum permissions for your user:

  • organizations:MoveAccount
aws organizations move-account --account-id ACCOUNT_ID --source-parent-id PARENT_ORG_ID --destination-parent-id OU_ID

We will need to issue this command for each account we need to move to an OU. We need to make sure we are using the correct destination ID to place the account into the proper OU.

We need to repeat the last 2 steps to create the sub OU for our overlapping HIPAA and PCI accounts. This time around the PARENT_ORG_ID will be changed from the ID of the organization itself to the ID of the organizational unit we want to create this sub OU in. We will create this OU within the HipaaOU that we created in the previous step.

And we can move those accounts that require both HIPAA and PCI compliance into this new OU using the same command we used to move the other accounts.

Service Control Policies

Simply moving accounts into OUs accomplishes nothing on its own. In order to take advantage of the power of these new OUs, we need to apply policies that will restrict the services that the accounts within the OU can access. At the time of this writing, Service Control Policies are the only policies that can be applied to an OU.

In order to apply a Service Control Policy to our account, we need to create a policy file that we can pass into the create-policy command. We could place this text within the command itself, but with the number of services we need to include and the fact that we have to escape characters, that approach is error-prone and very messy. Here’s what our policy file will look like

{ 
  “Version”: “2012-10-17”,
  “Statement”: [{
    “Effect”: “Allow”,
    “Action”: [
      “ec2:*”,
      “rds:*”,
      “dynamodb:*”
    ],
    “Resource”: “*”
  }]
}

In the above policy file, we are explicitly allowing a few services. There are many more HIPAA compliant services, but for the sake of this example, we are going to limit the policy to these three services.

TRAP FOR YOUNG PLAYERS:

It needs to be mentioned here that Service Control Policies which are applied to an OU will not grant any user any rights. We are not pushing this policy as a way to give each user in the accounts in the OU access to these services. This policy is in place as a way to restrict the permissions that can be applied to a user. And they will apply to all users, including administrators.

It’s also worth noting that the policies we are putting in place to restrict services assume that the “Allow *” policies have been removed from the root, OU, and individual accounts. If “Allow *” is still in place in any of these locations, the above policy will have no effect on the account(s) it is applied to.

We need to create two additional policy files, one for each additional OU type. Because we removed the “Allow *” policy from all accounts, OUs, and the root Organization, we will need to create a policy file for our GeneralOU that allows all services for that OU. We will reuse the PCI policy file for the sub OU that allows both HIPAA and PCI services.

Once we have our policy files in place, we can start creating those policies:

Minimum permissions for your user:

  • organizations:CreatePolicy
aws organizations create-policy --content file://allow_hipaa_policy.json --name AllowHipaaServices --type SERVICE_CONTROL_POLICY --description "This policy allows all HIPAA services"
aws organizations create-policy --content file://allow_pci_policy.json --name AllowPCIServices --type SERVICE_CONTROL_POLICY --description "This policy allows all PCI services"
organizations create-policy --content file://allow_all_policy.json --name AllowAllServices --type SERVICE_CONTROL_POLICY --description "This policy allows all services"

We have created three new policies that now need to be attached to our OUs.

Minimum permissions for your user:

  • organizations:AttachPolicy
aws organizations attach-policy --policy-id HIPAA_POLICY_ID --target-id HIPAA_OU_ID
aws organizations attach-policy --policy-id PCI_POLICY_ID --target-id HIPAA_OU_ID
aws organizations attach-policy --policy-id GENERAL_POLICY_ID --target-id HIPAA_OU_ID
aws organizations attach-policy --policy-id PCI_POLICY_ID --target-id HIPAA_PCI_OU_ID

Let’s take the time to examine what is happening here. We know that we have removed all permissions for all services for our root Organization, OUs, and accounts. We created policies that allow services that are compliant with HIPAA and PCI respectively. And we know that when we apply those policies to our OUs, the accounts within that OU will now have access to those services. In the case of the sub OU that allows both PCI and HIPAA services, the sub OU that has the overlapping accounts will inherit the services that are allowed by the HIPAA OU. Applying the AllowPCIServices policy to the sub OU will mean that in addition to the services it inherited, it will also be allowed to access the services which are PCI compliant.

Conclusion

Success! We have created a new Organization, invited our accounts into that organization, and grouped those accounts into OUs so we could ensure each group of accounts is compliant to the required standards. When dealing with a few accounts, working from the command line is fine. For larger amounts of accounts, it is highly recommended to script this process out.

AWS Organizations helps companies manage multiple accounts from a billing and policy standpoint. The use of Organizations helps reduce accidental security policies that violate compliance laws that companies may have to follow. It also reduces the time and effort required to create new accounts by providing an API that allows the auto-creation of new accounts with the correct policies already attached. Users can be restricted to the accounts they need access to and blocked from the accounts they don’t. All companies that have multiple accounts can benefit from the features provided by Organizations.

About Stelligent
Stelligent is an APN Advanced Consulting Partner and hold the AWS DevOps Competency. As a technology services company that provides DevOps Automation on Amazon Web Services (AWS) Cloud, we aim for “one-click deployment.” Our reason for being is to help our customers gain the ability to continuously deploy their software, when they want to, and with confidence. We’ve been providing DevOps Automation solutions on AWS since 2009. Follow @Stelligent on Twitter. Learn more at http://www.stelligent.com

DevOps in AWS Radio: Goss (Episode 9)

In this episode, Paul Duvall and Brian Jakovich cover recent DevOps in AWS news and speak with Ahmed Elsabbahy about Goss, a ServerSpec alternative for testing server configuration.

Here are the show notes:

DevOps on AWS News

Episode Topics

  1. What is Goss?
  2. Why was Goss created?
  3. Why would you use Goss over serverspec or other server configuration testing tools
  4. Where does Goss fit into a continuous delivery pipeline
  5. How does Goss work with AWS?
  6. How does Goss work with production testing?
  7. Where can we find out more information about Goss?

Additional Resources

About DevOps in AWS Radio

On DevOps in AWS Radio, we cover topics around applying DevOps principles and practices such as Continuous Delivery in the Amazon Web Services cloud. This is what we do at Stelligent for our customers. We’ll bring listeners into our roundtables and speak with engineers who’ve recently published on our blog and we’ll also be reaching out to the wider DevOps in AWS community to get their thoughts and insights.

The overall vision of this podcast is to describe how listeners can create a one-click (or “no click”) implementation of their software systems and infrastructure in the Amazon Web Services cloud so that teams can deliver software to users whenever there’s a business need to do so. The podcast will delve into the cultural, process, tooling, and organizational changes that can make this possible including:

  • Automation of
    • Networks (e.g. VPC)
    • Compute (EC2, Containers, Serverless, etc.)
    • Storage (e.g. S3, EBS, etc.)
    • Database and Data (RDS, DynamoDB, etc.)
  • Organizational and Team Structures and Practices
  • Team and Organization Communication and Collaboration
  • Cultural Indicators
  • Version control systems and processes
  • Deployment Pipelines
    • Orchestration of software delivery workflows
    • Execution of these workflows
  • Application/service Architectures – e.g. Microservices
  • Automation of Build and deployment processes
  • Automation of testing and other verification approaches, tools and systems
  • Automation of security practices and approaches
  • Continuous Feedback systems
  • Many other Topics…

Devops Benefits of Infrastructure as Code

Infrastructure and operations as code is an essential practice for realizing the advantages of modern clouds.  For enterprises looking to migrate to Amazon Web Services, Azure, or Google Cloud Platform, scripted infrastructure and automation are the key first steps through which other devops practices become accessible.  This post will enumerate some key benefits that become possible once we embrace infrastructure as code practices.

By codifying our infrastructure, we enable better testing and quality control, more efficient and predictable deployments, and decreased recovery times. It provides improved testability and monitoring, lowers the cost of experimentation and innovation, makes deployments more predictable, and decreases the mean time to resolution (MTTR) for issues.

Automate your deployment and recovery processes

With infrastructure automation, reproducible environments become possible.  We can use the same automation scripts to deploy exact copies of production to development, test, and production environments.  With these consistent deployments, we are able to achieve the ever-elusive development-to-prod parity, finally putting an end to the “it worked on my machine!” problems.

The pinnacle of infrastructure automation is the Blue/Green deployment strategy.  This strategy enables zero downtime deployments and allows us to run live tests before releasing our changes to our users.  Blue/Green Deployments take advantage of our ability to run exact copies of our environments in parallel.  By controlling when traffic is routed to our new copy, we can defer a release until we are 100% confident that our new environment is ready.

In a Blue/Green deployment, we deploy a new, isolated copy of our environment.  This new, copied environment is named Green.  It is our release candidate.  It contains our new changes and is isolated from the live environment, which we call Blue.  The Green environment is configured for production and is ready to go live, but it is launched darkly – that is, no traffic is routed to Green.  

Next, we run our acceptance tests against the live Green environment.  If we encounter an error, we can simply log the error, remove the Green environment and go back to the drawing board.  No users ever know a difference, as we never routed any live traffic to Green.  

If our acceptance tests do pass, we promote our Green environment to be the new live environment.  This can be done by changing a DNS entry to point at the Green environment or by removing the Blue environment from our load balancer and adding the Green environment to the load balancer.  

The Blue environment does not need to be automatically deleted.  If necessary, we can keep it around for a short grace period in case we need to rollback.  The rollback process would consist of reversing the traffic swap to point back at Blue.

This is merely an overview of the Blue/Green deployment strategy.  For an in-depth discussion of Blue/Green techniques in an AWS environment, see the AWS whitepaper on the topic.

Rollback with the same tested processes

Our deployment scripts are also our rollback scripts.  Because our deployments are automated, we can reproduce the state of the infrastructure any number of times by simply re-running the deployment scripts with the same inputs.  With our codified infrastructure, we can reach back in version control to grab any commit since the repository began.  By reverting to the desired commit and re-running our deployment scripts, we can restore the state of the infrastructure as it was on any given day.

Don’t Repair, Redeploy

Server time is cheap, but engineer time is expensive.  Further, troubleshooting server performance issues can be very time-consuming.  For these reasons, it no longer makes sense to troubleshoot and repair our servers.  Rather it is now more economical to destroy the old server instance and replace it with a new, working copy.

We can use our automated deployment scripts to deliver working servers to replace broken and impaired servers.  We can now follow an immutable infrastructure pattern, in which nothing ever changes on a server after it is deployed.  This helps avoid the problem of configuration drift and also greatly simplifies our operations.  Now, the only repair operation is to redeploy the service.  A service crashed?  Redeploy.  Having performance issues on a host?  Redeploy.  Lost connectivity to a host?  Redeploy.

Focus on Mean Time To Recovery

They say you can’t fix what you don’t measure, but it’s important to choose the right metrics to measure and improve upon. To traditional IT organizations, the key metric is Mean Time Between Failures (MTBF).   Server uptime is paramount, and this is the metric that gets optimized.  This leads to a reluctance to accept changes, as each change can potentially introduce a failure.  Moreover, configuration changes are generally made manually by administrators.  This leads to long-running  snowflake servers which are virtually impossible to reproduce.  This presents a very nasty challenge in restoring service availability when the inevitable failures do occur.  

Failure of an IT component means the organization is losing money.  But failures do and will happen.  In a cloud-native world, we solve this problem by turning it on its head.  Rather than trying to avoid failures, devops organizations accept that failures are a part of life and design our applications to minimize the impact of those failures by recovering gracefully.  To accomplish this, we focus on Mean Time To Recovery (MTTR) as our key metric.  By minimizing the time it takes to recover from failure, we minimize the impact of each failure.  Optimizing for MTTR necessitates automation of our processes.  Our recovery processes must be consistent and reliable.

Practice makes perfect

If we want to improve at anything, we have to practice.  Recovering from failures is no different.  We do not want the first test of our recovery processes to be during an actual disaster.  Rather, we want to test our recovery process numerous times before we actually need it.  Doing so gives us confidence that our recovery process will work as intended and restore the availability of our service.

Traditionally, creating an isolated environment for disaster recovery was too cost-prohibitive and time-consuming to be a feasible strategy.  The only way to test our process was to actually have a disaster.  However, with modern cloud environments, we no longer have this limitation.  Creating a new environment is an api call away.  Once we’ve codified our infrastructure, we can create a copy of our production environment by running the same code we used to create production.

We create our new copy environment to be totally isolated from our production environment.  We are now free to simulate disasters and test our recovery processes.  This can be done regularly in a low stress environment, allowing our engineering teams to troubleshoot and strategize without the added pressure of an actual outage.

Each time the process fails, we learn a little bit more.  We can then use this information to correct the problem and improve our automated recovery scripts.  At the very least, we document the known issues and add the solutions to common problems in our standard procedures.  

We should practice these failures regularly.  By the time an actual disaster occurs, we should have multiple practice runs of recovering from the disaster, as well as hundreds or even thousands of trial runs from the deployments being run with the same scripts.

Use testing tools to verify your infrastructure

With our infrastructure codified and our restore process automated, the next step is to design a set of automated tests that will verify.  Because we now think of our infrastructure as a software application, we should use software testing tools to test our infrastructure.  By using tools like Python’s Behave or Ruby’s Rspec, we can test that our service is behaving as expected.

These tests don’t have to be complicated, and can start out very simply.  The first test can just be “Is the service up and reachable?”  After all, this is the entire goal of the software project – if it is not up and working, it is of no use.  Then we can start to further refine our tests to include those behaviors we expect a healthy service to exhibit.  A good starting point is to hit each of our service’s endpoints in an automated fashion.  These basic tests give us a high level of certainty that the app is behaving as expected, and we can add more detailed testing to test for specific failure cases.

As we practice our failures and recovery process, we will find new issues that can cause our system to not operate correctly.  As these issues are discovered, we test for them and add those tests to our suite of automated tests.  These tests also double as regression tests.  When a new feature is added and a test breaks, we know exactly which change caused the service tests to fail.  As time goes by, we build a more comprehensive test suite and incrementally increase our confidence in our recovery process.  

Hook your tests into your monitoring system

Our automated test suite gives us confidence that our service is behaving correctly during deployment and recovery.  In these situations, the conditions are known and assumptions can be hidden.  But what happens when our service is used in unexpected ways, as is bound to happen when real users start using the application?  We can hook these tests into our monitoring systems and run these tests on a periodic basis.  In this way, we can be alerted the moment something goes wrong.  Running our tests in this fashion will allow us to test against real world scenarios.  This is our first line of defense in detecting real world errors.  

Conclusion

You don’t have to be a Netflix or Airbnb company to take advantage of devops practices. Fortune 500 companies and government agencies are adopting these patterns so that they can recover from failure more quickly, deploy more often, deploy more quickly.  The prerequisite to practicing these modern devops techniques is Infrastructure as Code.  If your organization is looking to begin capitalizing on the benefits of modern clouds but does not know where to start, codifying infrastructure should be the first step.  

Are you looking for guidance transitioning your legacy apps to AWS?  Stelligent can help!  Stelligent Migrate is our service in which we help facilitate the migration of your enterprise workloads to AWS.  If you have any questions or are interested in how Stelligent can help, please reach out to sales@stelligent.com!

Vault : A tool for managing secrets.

Over the last few weeks, we’ve been reminded once again of why Cyber Security is of the utmost importance. Not only can security breaches cause a denial of service but they can also cause loss of intellectual property, espionage and many other embarrassing incidents. As a member of IT staff, instead of relying just on security team, every member should also be proactive in identifying security gaps.

In this blog post, we will take one small proactive measure that can go a long way: database passwords.

We’ve seen this often: mission critical application store passwords to mission critical database in a config file, which is copied around on multiple servers and checked-in into the source code repository. More often than not, the password in this config file is unencrypted. This setup not only makes access to your mission critical data vulnerable to outside threats but also to a disgruntled employee, or other insider threats.

There is a better solution, one which addresses both insider and outsider threats. We’ll address the solution to both of these problems using Hashicorp Vault. Although Vault has many use cases, in this blog we’ll address the specific use case of managing database passwords and best practices in doing so.

Let us first quickly discuss the key features of Vault and how we’re going to leverage them in our solution.

  • Secret Storage: Vault can store secrets in memory (non-persistent) or use third party (persistent) store. In either case, secrets are encrypted before storing. In our case, we’ll use Hashicorp Consul for the persistent secret store.
  • Dynamic Secrets: Vault allows to create on-demand secrets. We will create passwords dynamically to connect to a database; in our case, this will be a MySQL database.
  • Leasing, Renewal and Revocation: Vault allows password rotation based on lease associated with them. In a case of intrusion, Vault allows key revocation for system lockdown. For our demonstration, we will renew the lease with every database connection. However, in production one can set the renewal period based on their security and organizational policies.

Now that we understand Vault features, let’s look at how we can use it in a scenario where a Java application wants to connect and read data from MySQL database. Before we start using Vault we need to set it up and configure it. This section is typically done by system administrators. However, with Vault there is one major difference: the almighty “unseal” process.

Vault Setup

Depending on your system download and install Vault (install instructions) and Consul (install instructions).

Follow the step by step instructions provided below for Vault setup. Vault init (step-3) is done only once when the server is started for the first time with a secret storage (Consul) that has never been used before.

The Vault init step in incredibly important and as a best practice, it must be done in presence of a few other stakeholders in your organization. Vault init steps outputs unseal keys and initial root token which should be distributed among stakeholders. Vault splits the root token into multiple unseal keys using “Shamir’s Secret Sharing” algorithm.

Once initialized, a quorum is needed to read configuration information from vault. The number of unseal keys needed to establish a quorum can be set by passing the parameter “key-shares” and the maximum number of unseal keys generated by Vault can be set by passing parameter “key-threshold”. So in our case, three out of five stakeholders are needed to form a quorum. The root token should be destroyed because, if needed, Vault can issue one-time-root token by using the unseal keys.

In essence by chunking the root token (aka root password) into multiple unseal keys and distributing those within multiple individuals we’re protecting against a single individual hijacking our corporate data.

MySQL Database Staging

We need to create a user in MySQL database which will be used by Vault to login and dynamically create users based on access policies and lease time we set in Vault. In our case, we’ll create user “vaultadmin”.

Vault Database (MySQL Secret Backend plugin) Setup

Secret backend help store and generate secrets dynamically. In our case we’ll use database secret backend and MySQL plugin to create database credentials dynamically based on configured access control policies.

Now that Vault is setup and configured, we need create a mechanism to let our application authenticate to Vault so that it is able to read the database credentials. For our purpose we will use token-based authentication:

Spring-Boot Application Setup

We should use the authentication token created above in the spring boot bootstrap.properties file.

The source code for application can be found at https://github.com/stelligent/MySQL-Vault. The source code uses spring cloud libraries, specifically artifacts “spring-cloud-starter-vault-config” and “Spring-cloud-vault-config-databases” to connect to Vault and MySQL database. 

Application to Database Authentication Workflow

vault-mysql-workflow

  1. Application uses token and role to authenticate to Vault.
  2. Vault validates the token and role against roles and policies stored in secret storage. If validated, the workflow continues; otherwise access is rejected.
  3. Vault secret/database backend connects to MySQL database using connection information from step 2 of Vault Database Setup.
  4. Vault secret/database issues SQL statements to create user in MySQL database based on configuration in step 3 of Vault Database Setup.
  5. Vault passes the username/password information to application. Application can then access the database until the lease time expires. You can set the lease to expire per your organization’s password rotation policy. Once a lease expires, the credentials won’t be valid and the application will need to retake steps 1 through 5.

Using Vault it’s possible to have applications use dynamically generated encrypted passwords with a key rotation policy which is good for security and compliance. Moreover Vault itself provides protection against single user accessing corporate secrets. Using Hashicorp Vault is much more secure  than passing passwords around in configuration files.

Did you find this post interesting? Are you passionate about working with the latest AWS technologies? If so, Stelligent is hiring and we would love to hear from you!

Service discovery for microservices with mu

mu is a tool that makes it simple and cost-efficient for developers to use AWS as the platform for running their microservices.  In this fourth post of the blog series focused on the mu tool, we will use mu to setup Consul for service discovery between multiple microservices.  

Why do I need service discovery?

One of the biggest benefits of a microservices architecture is that the services can be deployed independently of one another.  However, this presents a new challenge in that it becomes difficult for clients to know the list of containers to use when invoking the service.  Here are three different approaches to address this challenge:

  • Load balancer per microservice: Create a load balancer for every microservice and add/remove containers to the load balancer as deployments and scaling events occur.  The endpoint address of the load balancer is then shared with clients through some manual process.

cloudcraft - Microservices - multip.png

There are three concerns with this approach.  First, the endpoint address of the load balancer must never change or else all the clients will be broken and require updates to take the new endpoint address.  This can be addressed via DNS CNAME records, but still requires that the name chosen for the record must not change.  Second, there is the additional cost of a load balancer for every microservice.  Finally, there is additional latency introduced with adding a load balancer between each microservice invocation.

  • Shared load balancer: Create a load balancer that is shared by all microservices in an environment.  The load balancer must have rules for each microservice to route requests by URI patterns.

ms-architecture-3

The concern with this approach is that all traffic is now flowing through a single load balancer which can become a constraint in scaling the entire system.  Additionally, the load balancer becomes a shared resource amongst all the microservice teams, potentially impacting a team’s ability to operate independently of other teams.

  • Client load balancer: Load balancing from within the client is an approach in which the client has an awareness of all the containers in-service for a given microservice.  The client can then load balance between the containers when invoking the microservice.  This approach requires a system to provide service registration and service discovery.   

cloudcraft - mu-bananaservice-v3

The benefit with this approach is there are no longer load balancers between each microservice request so all the concerns with those prior approaches are addressed.  However, a new type microservice, an edge service, will need to be deployed to allow clients outside the microservice environment (that do not have access to service discovery) to invoke the service.

The preferred approach is the third approach which uses service discovery and client side load balancing within the microservice environment and edge routing with traditional load balancing for clients outside the microservice environment.  This approach provides the lowest latency and most loosely coupled solution for microservice invocation.

Let mu help!

The environment that mu creates for your microservice can manage the provisioning of Consul for service discovery and registration of your microservices.  Consul is a sort of phonebook for microservices.  It provides APIs for services to register their endpoints and for clients to lookup the endpoints.

Let’s demonstrate this by adding an additional milkshake service to the invoke the banana service from the first post.  Additionally, we will create a zuul router service to provide an edge service via Netflix’s Zuul.  Zuul is a proxy service that serves as the front door for all requests from outside the microservice environment.  Zuul will use Consul for service discovery to determine where best to route the incoming request.  Additionally, Zuul provides an excellent location to enforce policies such as authentication, authorization or logging on all incoming requests.

Enabling Consul and Edge Router

The first thing we will want to do is set up our edge router with Zuul.  This is just a matter of adding the @EnableZuulProxy and @EnableDiscoveryClient annotations to the Spring Boot application:

@SpringBootApplication
@EnableDiscoveryClient
@EnableZuulProxy
public class ZuulRouterApplication {

   public static void main(String[] args) {
     SpringApplication.run(ZuulRouterApplication.class, args);
   }
}

Zuul is configured via the application.yml file in src/main/resources.  For each service that we want exposed via the edge router, we add URI path patterns:

spring:
  application:
    name: zuul-router
zuul:
  routes:
    milkshake-service:
      path: /milkshakes/**
      stripPrefix: false
    banana-service:
      path: /bananas/**
      stripPrefix: false

In order to enable Consul in your environment, you need to update the environment definition in the mu.yml file.  Additionally, you need to configure Spring Cloud Consul to connect to the docker host ip address for service discovery.  We will also want to configure Spring Cloud to not register with Consul, since mu will already configure the Registrator agent on your ECS container instances:

 environments:
 - name: acceptance
   cluster:
     maxSize: 5
   discovery:
     provider: consul
 - name: production

service:
  name: zuul-router
  port: 8080
  pathPatterns:
  - /*
  environment:
    SPRING_CLOUD_CONSUL_HOST: 172.17.0.1
    SPRING_CLOUD_CONSUL_DISCOVERY_REGISTER: 'false'
  pipeline:
    source:
      provider: GitHub
      repo: cplee/zuul-router
    build:
      image: aws/codebuild/java:openjdk-8

Create Milkshake Service

Now we can create a new service to manage the creation of milkshakes.  The service looks very similar to the banana service, with the exception of declaring a Spring RestTemplate annotated with @LoadBalanced to enable client side loadbalancing via Ribbon.

 

@SpringBootApplication
@EnableDiscoveryClient
public class MilkshakeApplication {

  @LoadBalanced
  @Bean
  RestTemplate restTemplate(){
     return new RestTemplate();
  }
}

Now we can use the RestTemplate to make calls directly to the banana service.  Ribbon will do a lookup in Consul for a service named banana-service and replace it in the URL with one of the container’s IP and port:

@Component
public class BananaProvider implements FlavorProvider {

  @Autowired
  private RestTemplate restTemplate;

  private List<Map<String,Object>> getAll() {
    ParameterizedTypeReference<List<Map<String, Object>>> typeRef =
            new ParameterizedTypeReference<List<Map<String, Object>>>() {};

    ResponseEntity<List<Map<String, Object>>> exchange =
            this.restTemplate.exchange("http://banana-service/bananas",HttpMethod.GET,null, typeRef);

    return exchange.getBody();
  } 

Try it out!

After we have deployed all three services, we can use mu to confirm that all are running as expected.

~ ❯❯❯ mu env show acceptance                                                                                                                                                                                                       

Environment:    acceptance
Cluster Stack:  mu-cluster-dev (UPDATE_COMPLETE)
VPC Stack:      mu-vpc-dev (UPDATE_COMPLETE)
Bastion Host:   35.164.117.25
Base URL:       http://mu-cl-EcsEl-144KXQMIRY9WI-1411768500.us-west-2.elb.amazonaws.com

Container Instances:
+---------------------+----------+--------------+------------+-----------+--------+---------+-----------+-----------+
|    EC2 INSTANCE     |   TYPE   |     AMI      |     AZ     | CONNECTED | STATUS | # TASKS | CPU AVAIL | MEM AVAIL |
+---------------------+----------+--------------+------------+-----------+--------+---------+-----------+-----------+
| i-08e3edc8c644f0534 | t2.micro | ami-62d35c02 | us-west-2b | true      | ACTIVE |       3 |       604 |       139 |
| i-05bc14a67e53889e1 | t2.micro | ami-62d35c02 | us-west-2a | true      | ACTIVE |       3 |       604 |       139 |
| i-0b56a0d9572531e9e | t2.micro | ami-62d35c02 | us-west-2c | true      | ACTIVE |       3 |       604 |       139 |
| i-05b2188a5c575fbeb | t2.micro | ami-62d35c02 | us-west-2b | true      | ACTIVE |       1 |       624 |       739 |
+---------------------+----------+--------------+------------+-----------+--------+---------+-----------+-----------+

Services:
+-------------------+---------------------------+------------------+---------------------+
|      SERVICE      |         IMAGE             |      STATUS      |     LAST UPDATE     |
+-------------------+---------------------------+------------------+---------------------+
| milkshake-service | milkshake-service:9e4bcd9 | CREATE_COMPLETE  | 2017-05-12 11:33:05 |
| zuul-router       | zuul-router:3d4795c       | UPDATE_COMPLETE  | 2017-05-12 12:09:47 | 
| banana-service    | banana-service:3b62124    | UPDATE_COMPLETE  | 2017-05-12 11:32:55 |
+-------------------+---------------------------+------------------+---------------------+

We can then use curl to get a list of all the bananas available via the banana-service:

curl -s http://mu-cl-EcsEl-144KXQMIRY9WI-1411768500.us-west-2.elb.amazonaws.com/bananas | jq
[
  {
    "pickedAt": null,
    "peeled": null,
    "links": [
      {
        "rel": "self",
        "href": "http://mu-cl-ecsel-144kxqmiry9wi-1411768500.us-west-2.elb.amazonaws.com/bananas/9"
      }
    ]
  }
]

Next we try to create a milkshake using the milkshake-service:

~ ❯❯❯ curl -s -d "{}" -H "Content-Type: application/json" http://mu-cl-EcsEl-144KXQMIRY9WI-1411768500.us-west-2.elb.amazonaws.com/milkshakes\?flavor\=Banana | jq                                                                         
{
  "timestamp": "2017-05-15T19:12:56.640+0000",
  "status": 500,
  "error": "Internal Server Error",
  "exception": "org.springframework.web.client.HttpClientErrorException",
  "message": "429 Not enough bananas to make the shake.",
  "path": "/milkshakes"
}

Looks like there aren’t enough bananas to create a milkshake.  Let’s create another one:

~ ❯❯❯ curl -s -d "{}" -H "Content-Type: application/json" http://mu-cl-EcsEl-144KXQMIRY9WI-1411768500.us-west-2.elb.amazonaws.com/bananas

~ ❯❯❯ curl -s http://mu-cl-EcsEl-144KXQMIRY9WI-1411768500.us-west-2.elb.amazonaws.com/bananas | jq                                                                                                                         
[
  {
    "pickedAt": null,
    "peeled": null,
    "links": [
      {
        "rel": "self",
        "href": "http://mu-cl-ecsel-144kxqmiry9wi-1411768500.us-west-2.elb.amazonaws.com/bananas/9"
      }
    ]
  },
  {
    "pickedAt": null,
    "peeled": null,
    "links": [
      {
        "rel": "self",
        "href": "http://mu-cl-ecsel-144kxqmiry9wi-1411768500.us-west-2.elb.amazonaws.com/bananas/10"
      }
    ]
  }
]

Now let’s try again creating a milkshake:

~ ❯❯❯ curl -s -d "{}" -H "application/json" http://mu-cl-EcsEl-144KXQMIRY9WI-1411768500.us-west-2.elb.amazonaws.com/milkshakes\?flavor\=Banana | jq                                                                      
{
  "id": 3,
  "flavor": "Banana"
}

This time it worked, and if we query the list of bananas again, we see that 2 have been deleted for the milkshake:

~ ❯❯❯ curl -s http://mu-cl-EcsEl-144KXQMIRY9WI-1411768500.us-west-2.elb.amazonaws.com/bananas | jq                                                                                                                        
[]

Conclusion

Decomposing a monolithic application into microservices presents an interesting challenge in enabling services to invoke one another while still keeping the services loosely coupled.  Using a client side load balancer like Ribbon along with a service discovery tool like Consul provide an excellent solution to this challenge.  As demonstrated in this post, mu makes it simple to enable service discovery in your microservice environment to help achieve this solution.  Head over to stelligent/mu on GitHub and get started!

Additional Resources

Did you find this post interesting? Are you passionate about working with the latest AWS technologies? If so, Stelligent is hiring and we would love to hear from you!

Cloud Custodian Cleans Up Your Cloud Clutter

AWS allows you to build enormous and complex cloud infrastructures in a matter of hours. With the ability to create resources so easily, sometimes it can be hard to manage all those resources. If only there were a simple but powerful tool that could manage it all. Cloud Custodian (a.k.a C7N) is a Python CLI tool that gives you powerful account management capabilities with a simple config file. Cloud Custodian can help you manage your AWS account using a simple policy config file and time-based or event-based Lambdas. The config files (YAML formatted) allow you to define policies for everything from tag compliance to backups. Define policies for a wide variety of management activities, including garbage collection to encryption.

Cloud computing has made creating and managing web resources insanely easy, quite possibly too easy. You can now spin up quite a few computing, database, and storage resources with the click of a button or the stroke of a return key. However, if you use a company account, you likely spin up those resources often for demonstration and testing purposes, without considering the cost or clutter you might be creating along with it. This was “the problem” at Capital One when they created this very powerful tool for managing the cleanup of your superfluous cloud resources. Capital One started developing Cloud Custodian in July 2015 and open-sourced the tool in April 2016.

Cloud Custodian’s feature-set has grown exponentially with it’s popularity because they’re very good about responding to feature requests. It’s now grown to the point where there’s not much in the AWS world you can’t do with it. Here’s a short list of some things you might be surprised it can do:

  • Encryption
  • Backups
  • Garbage Collection
  • Unused Resources
  • Off-hours
  • Tag Compliance
  • SG Compliance

Odds are though, you’re considering Cloud Custodian for it’s namesake: cleaning up your AWS account; resource/cost management during off-hours; and overall garbage collection. True to it’s name, this is where Cloud Custodian excels. With a relatively simple configuration file you can tidy and trim your AWS account and keep it that way as you grow your business.

Here’s a very basic example custodian.yml file that stops all EC2 instances tagged with Custodian:

policies:
  - name: stop-instances
    resource: ec2
    filters:
      - "tag:Custodian": present
    actions:
      - stop

Cloud Custodian is great for mid to large sized companies that give a large number of their employees full access to a company AWS account. Naturally, their account quickly becomes cluttered with dozens of CloudFormation stacks, VPCs, old test instances, and Lambda functions. Here at Stelligent we have an AWS Labs account for exploring and testing in AWS. We use Cloud Custodian to clean up old testing resources based on age and resource tags. 

Your Cloud Custodian Strategy

singlenodedeploy.pngSource

As of today, there’s not much Custodian can’t do in your AWS account so it’s good to explore what Cloud Custodian can do for you before deciding on your overall strategy. Here are four common use-cases:

  • Automatic clean-up: Using the mode property, you can run actions in response to a variety of CloudWatch EventsRead more
  • Monitoring your environment: This is one of my favorite features. Custodian generates CloudWatch metrics by default so it’s easy to throw together dashboards that give you full visibility into what is being managed by Custodian and what isn’t. It’s hard to get good visibility in a vast system like AWS. Read more
  • Stopping Resources during Off-hours: Custodian makes it very easy to set up Off-hours for your resources based on tags. Below is an example. Read more

    policies:
      - name: offhours-stop
        resource: ec2
        filters:
          - type: offhour
            tag: downtime
            onhour: 8
            offhour: 20
        actions:
          - stop
  • Tag-compliance: One of the most common use cases for Custodian is tag-compliance. You can manage tag-compliance policies for your entire account in a single config file. You can even check it into version control. And, if you’re really ambitious, you can create a pipeline to watch your version control system that runs custodian for you so you don’t have to activate virtualenv on your personal machine every time you want to make a change. Read more

Prereqs and Pro Tips

Cloud Custodian is very well documented, so if you’re excited to start taking out the digital trash in your AWS account there’s no better place to start than their website and documentation. There are a few things to keep in mind before diving head-first into the cloud equivalent of the custodial arts:

Prereqs

At Stelligent we’re a tiny bit obsessed with one-command solutions. I have to admit, I cringed a little when I saw that Cloud Custodian took 3 or more commands to install and run. In the Getting Started section of the docs, Custodian requires you to have python, pip and virtualenv installed before you can even install Cloud Custodian. Then, once you activate virtualenv to install and run it the first, you’ll need to re-activate the virtualenv every time you want to run it in the future. That’s why I recommend using a Pipeline or CloudFormation template which brings us directly to our pro tips:

Pro Tip #1 – Minimalist Custodian

The easiest way to get started cleaning up your AWS account with Custodian is to go through your account and tag everything you want to keep with something like “NoCustodian”. Then, set

policies:
  - name: stop-instances
    resource: ec2
    filters:
      - "tag:Custodian": present
    actions:
      - stop

Click the button below to launch an example CloudFormation Stack that boots an EC2 instance and then uses Custodian to stop the instance.

Pro Tip #2 – Don’t piss off your co-workers

The first thing you’ll be tempted to do when implementing Cloud Custodian is terminate all the old and un-used resources in your account. Just be sure all the relevant parties in your company know what you’ll be terminating and when.

Pro Tip #3 – Use a Pipeline

Setup a CodePipeline that allows you to keep your custodian.yml file in source control and re-run it with CodeBuild every time you commit a change.

In Summary

If you need better visibility and automated management of your AWS account, Cloud Custodian has lots of helpful features that are easy to manage in a single config file. If you aren’t already a python developer, I recommend setting up a CloudFormation template or Automated Pipeline to manage changes to your. You can use the launch button in Pro Tip #1 to see an example of Custodian in a CloudFormation template. Keep a look out for a future blog post with a detailed example of a fully-automated Custodian Pipeline.

Links
Custodian Website
Custodian Docs
Custodian GitHub

Videos
AWS This Is My Architecture: Cloud Custodian
Cloud Custodian @ AWS re:Invent
Cloud Custodian @ Serverlessconf


Stelligent is hiring! Do you enjoy working on complex problems like figuring out ways to automate all the things as part of a deployment pipeline? Do you believe in the “one-button everything” mantra? If your skills and interests lie at the intersection of DevOps automation and the AWS cloud, check out the careers page on our website.

Microservice databases with mu

mu is a tool that makes it simple and cost-efficient for developers to use AWS as the platform for running their microservices.  In this third post of the blog series focused on the mu tool, we will use mu to manage microservice databases in the pipeline we built in the first post.  

Why should my microservice manage the database?

As discussed in prior posts, adopting a microservice architecture can increase a team’s ability to deliver software faster through decoupling and team autonomy.  By decomposing an application into microservices and then giving teams complete ownership of their microservices, the teams can then make decisions and implement changes independent of other teams and their microservices.

Unless the same approach is taken to decompose the databases that support the microservices, the benefits of microservices will be limited by the cross team dependencies on shared databases. When your microservices share a database then in effect you’ve used the database as an API between the services.  This type of architecture causes tight coupling between services and likely will require regression testing and even deployment of multiple services at the same time.

Martin Fowler, in his post titled Microservices, says “Microservices prefer letting each service manage its own database.”  By decomposing all the way down into the database you can realize the benefits of agility that microservices has to offer.

decentralised-data
Source: https://martinfowler.com/articles/microservices.html

Let mu help!

blog1The continuous delivery pipeline that mu creates for your microservice can manage the provisioning of a database.  Additionally, the details about the database can be injected into your service as environment variables.

Let’s demonstrate this by adding a database to the microservice pipeline we created in the first post for the banana service.

Define the database

Previously, the banana service was using an embedded H2 database.  This won’t work in a production environment so we need an RDS database instance that the microservice can use.  Adding a database for a service with mu is as simple as adding a couple lines to your mu.yml file:

service:
  name: banana-service
  port: 8080
  pathPatterns:
  - /bananas
  database:
    name: banana

By default, this will create an RDS database instance of size db.t2.small with the aurora engine.  Next we need to reference the database from our microservice.  We can pass the database URL and credentials via environment variables:

service:
  name: banana-service
  port: 8080
  pathPatterns:
  - /bananas
  database:
    name: banana

  environment:
    SPRING_DATASOURCE_USERNAME: ${DatabaseMasterUsername}
    SPRING_DATASOURCE_PASSWORD: ${DatabaseMasterPassword}
    SPRING_DATASOURCE_URL: jdbc:mysql://${DatabaseEndpointAddress}:${DatabaseEndpointPort}/${DatabaseName}

This approach does have the disadvantage of passing database credentials as environment variables.  This presents a security issue as any IAM user/role with access to ECS task API would be able to discover the credentials.

AWS has recently announced IAM database authentication that can be utilized to obtain temporary database credentials from the microservice via an AWS API call.  Although we will save the details for a future blog post, for now it’s worth mentioning that mu can configure the database for IAM database authentication to work around this issue of passing credentials as environment variables.  This would be accomplished with a mu.yml like this:

service:
  name: banana-service
  port: 8080
  pathPatterns:
  - /bananas
  database:
    name: banana
    instanceClass: db.t2.medium
    iamAuthentication: true

  environment:
       SPRING_DATASOURCE_URL: jdbc:mysql://${DatabaseEndpointAddress}:${DatabaseEndpointPort}/${DatabaseName}

The configuration of the tables and the data in the database is managed with Liquibase. When the service is started, Liquibase creates/updates the database tables and data. This is accomplished by creating the a file named db.changelog-master.yaml  in src/main/resources/db/changelog/

Now we can commit and push our changes to cause a new run of the pipeline to occur:

$ git add --all && git commit -m "add database" && git push

We see our pipeline is green, so we have confidence that the new database is working properly with the microservice.

Conclusion

Realizing the benefits of microservices requires decomposing not just the application, but also the databases that support it.  As demonstrated in this post, mu makes it simple to manage your database and wire them up to your microservices.  The goal is that mu empowers you to implement microservice best practices in your application.

In the upcoming posts in this blog series, we will look into:

  • Service Discovery – use mu to enable service discovery via `Consul` to allow for inter-service communication
  • Additional Use Cases – deploy applications other than microservices via mu, like a wordpress stack

Until then, head over to stelligent/mu on GitHub and get started!

Additional Resources

Did you find this post interesting? Are you passionate about working with the latest AWS technologies? If so, Stelligent is hiring and we would love to hear from you!

Migrating from ServerSpec to InSpec

Key Benefits

  • OS and CM agnostic
  • InSpec runtime is faster than ServerSpec
  • DRY – Componentization of test suites using shared InSpec Profiles
  • Removes downloading of Gem files on every test execution
  • Quick to Change from ServerSpec to InSpec
  • CLI binary available to run tests from CI/CD
  • Allows usage of community test suites
  • Allows usage of InSpec Profiles from a Chef Compliance server
  • Allows compliance reporting to Chef Automate Visibility server

ServerSpec has been a great automated integration test framework for years, but it has always had some limitations. Such as, when running with Test Kitchen it downloads several gems every time tests are run and the inability to easily have shared test suites that could be used by multiple cookbooks or delivery solutions. Chef took on the challenge of extending ServerSpec to solve many of these inadequacies. Chef created a project named InSpec. It supports the same fundamental syntax of ServerSpec and adds extended capabilities.

InSpec does not require downloading gems during testing. ChefDK comes with all you need which the minimum is the inspec Rubygem. It also has the Test Kitchen extension gem named kitchen-inspec. Just this along makes for faster testing. Also, it helps for environments that may not have internet access.

InSpec also brings shared capabilities by way of what they’ve called InSpec Profiles. InSpec Profiles can be simple or as complex as you would like to make. With Profiles, you can also pass arguments. For example, you can have a shared InSpec Profile hosted in its own Git repo that tests for the version of Chef Client installed. You can set a default Chef Client version to check for, but allow an argument to override the default. So, you can pass in an argument string to the test profile with the version you expect and want it to check for. I have a simple example of that here.

When using InSpec with Test Kitchen it is possible to call multiple InSpec Profiles remote and/or local. For example, it’s possible to have said a set of standard security/compliance profiles, other cookbooks that are wrapped, baseline profile (bootstrapping) and then have a local profile that checks the specific cookbook configurations.

Here’s an example Kitchen config from bonusbits_mediawiki_nginx cookbook. This is at root level of a test suite.

    verifier:
      inspec_tests:
        - name: bootstrap
          git: https://github.com/bonusbits/inspec_bootstrap.git
        - name: bonusbits_base
          git: https://github.com/bonusbits/inspec_bonusbits_base.git
        - path: test/inspec/profiles/bonusbits_web/
      attributes:
        chef_version: '12.19.36'

It’s fairly easy and quick to migrate ServerSpec tests to InSpec tests. In all actuality, you could migrate some local ServerSpec tests to InSpec in a matter of minutes.

To demonstrate how quickly we can convert ServerSpec to InSpec I have created a short series of videos that walk through this process. Companions to the videos are Github reference branches for each part and wiki articles that I have linked below:

ServerSpec to InSpec – Part 1
Creating a ServerSpec Tested Chef Cookbook


Github Branch – Part 1
Wiki Article – Part 1

ServerSpec to InSpec – Part 2
Converting Local ServerSpec to Local InSpec Tests


Github Branch – Part 2
Wiki Article – Part 2

ServerSpec to InSpec – Part 3
Converting Local InSpec to Shared Remote Tests


Github Branch – Part 3
Github Example Shared InSpec Profile
Wiki Article

Complete Walkthrough Video Playlist

Spoiler!

Ok, so if you don’t have time to watch videos or read the wiki articles. Here’s the basic conversion for local ServerSpec to local InSpec tests. This is from Part 2. Part 3 goes into the best part about InSpec which is remote/shared tests.

Before

test
└── integration
 ├── default
 │   └── serverspec
 │      ├── nginx_spec.rb
 │      └── phpfpm_spec.rb
 └── helpers
    └── serverspec
       └── spec_helper.rb
test/integration/default/serverspec/nginx_spec.rb
require 'spec_helper'

describe 'Nginx' do
  it 'nginx installed' do
    expect(package('nginx')).to be_installed
  end

  it 'nginx service' do
    expect(service('nginx')).to be_enabled
    expect(service('nginx')).to be_running
  end
end
test/integration/default/serverspec/phpfpm_spec.rb
require 'spec_helper'

describe 'Php FPM' do
  it 'php-fpm installed' do
    expect(package('php70-fpm')).to be_installed
  end

  it 'php-fpm service' do
    expect(service('php-fpm-7.0')).to be_enabled
    expect(service('php-fpm-7.0')).to be_running
  end

  it 'nginx owns /var/log/php-fpm' do
    expect(file('/var/log/php-fpm')).to be_owned_by('nginx')
    expect(file('/var/log/php-fpm/7.0')).to be_owned_by('nginx')
  end

  it 'nginx owns /var/lib/php/7.0' do
    expect(file('/var/lib/php/7.0')).to be_grouped_into('nginx')
  end
end
test/integration/helpers/serverspec/spec_helper.rb
# Encoding: utf-8
require 'serverspec'

if (/cygwin|mswin|mingw|bccwin|wince|emx/ =~ RUBY_PLATFORM).nil?
  set :backend, :exec
  set :path, '/sbin:/usr/local/sbin:/bin:/usr/bin:$PATH'
else
  set :backend, :cmd
  set :os, family: 'windows'
end

After

test
└── integration
    └── default
        └── inspec
            ├── nginx_spec.rb
            └── phpfpm_spec.rb
test/integration/default/serverspec/nginx_spec.rb
describe 'Nginx' do
  it 'nginx installed' do
    expect(package('nginx')).to be_installed
  end

  it 'nginx service' do
    expect(service('nginx')).to be_enabled
    expect(service('nginx')).to be_running
  end
end
test/integration/default/serverspec/phpfpm_spec.rb
describe 'Php FPM' do
  it 'php-fpm installed' do
    expect(package('php70-fpm')).to be_installed
  end

  it 'php-fpm service' do
    expect(service('php-fpm-7.0')).to be_enabled
    expect(service('php-fpm-7.0')).to be_running
  end

  it 'nginx owns /var/log/php-fpm' do
    expect(file('/var/log/php-fpm')).to be_owned_by('nginx')
    expect(file('/var/log/php-fpm/7.0')).to be_owned_by('nginx')
  end

  it 'nginx owns /var/lib/php/7.0' do
    expect(file('/var/lib/php/7.0')).to be_grouped_into('nginx')
  end
end
.kitchen.yml
verifier:
  name: inspec

Resources

Microservice testing with mu: injecting quality into the pipeline

mu is a tool that makes it simple and cost-efficient for developers to use AWS as the platform for running their microservices.  In this second post of the blog series focused on the mu tool, we will use mu to incorporate automated testing in the microservice pipeline we built in the first post.  

Why should I care about testing?

Most people, when asked why they want to adopt continuous delivery, will reply that they want to “go faster”.  Although continuous delivery will enable teams to get to production quicker, people often overlook the fact that it will also improve the quality of the software…at the same time.

Martin Fowler, in his post titled ContinuousDelivery, says you’re doing continuous delivery when:

  • Your software is deployable throughout its lifecycle
  • Your team prioritizes keeping the software deployable over working on new features
  • Anybody can get fast, automated feedback on the production readiness of their systems any time somebody makes a change to them
  • You can perform push-button deployments of any version of the software to any environment on demand

It’s important to recognize that the first three points are all about quality.  Only when a team focuses on injecting quality throughout the delivery pipeline can they safely “go faster”.  Fowler’s list of continuous delivery characteristics is helpful in assessing when a team is doing it right.  In contrast, here is a list of indicators that show when a team is doing it wrong:

  • Testing is done late in a sprint or after multiple sprints
  • Developers don’t care about quality…that is left to the QA team
  • A limited number of people are able to execute tests and assess production readiness
  • Majority of tests require manual execution

This problem is only compounded with microservices.  By increasing the number of deployable artifacts by a factor of 10x or 100x, you are increasing the complexity of the system and therefore the volume of testing required.  In short, if you are trying to do microservices and continuous delivery without considering test automation, you are doing it wrong.

Let mu help!

blog1The continuous delivery pipeline that mu creates for your microservice will run automated tests that you define on every execution of the pipeline.  This provides quick feedback to all team members as to the production readiness of your microservice.

mu accomplishes this by adding a step to the pipeline that runs a CodeBuild project to execute your tests.  Any tool that you can run from within CodeBuild can be used to test your microservice.

Let’s demonstrate this by adding automated tests to the microservice pipeline we created in the first post for the banana service.

Define tests with Postman

First, we’ll use Postman to define a test collection for our microservice.  Details on how to use Postman are beyond the scope of this post, but here are few good videos to learn more:

I started by creating a test collection named “Bananas”.  Then I created requests in the collection for the various REST endpoints I have in my microservice.  The requests use a Postman variable named “BASE_URL” in the URL to allow these tests to be run in other environments.  Finally, I defined tests in the JavaScript DSL that is provided by Postman to validate the results match my expectations.

Below, you will find an example of one of the requests in my collection:

blog2

Once we have our collection created and we confirm that our tests pass locally, we can export the collection as a JSON file and save it in our microservices repository.  For this example, I’ve exported the collection to “src/test/postman/collection.json”.

blog3.png

Run tests with CodeBuild

Now that we have our end to end tests defined in a Postman collection, we can use Newman to run these tests from CodeBuild.  The pipeline that mu creates will check for the existence of a file named buildspec-test.yml and if it exists, will use that for running the tests.  

There are three important aspects of the buildspec:

  • Install the Newman tool via NPM
  • Run our test collection with Newman
  • Keep the results as a pipeline artifact

Here’s the buildspec-test.yml file that was created:

version: 0.1

## Use newman to run a postman collection.  
## The env.json file is created by the pipeline with BASE_URL defined

phases:
  install:
    commands:
      - npm install newman --global
  build:
    commands:
      - newman run -e env.json -r html,json,junit,cli src/test/postman/collection.json

artifacts:
  files:
    - newman/*

The final change that we need to make for mu to run our tests in the pipeline is to specify the image for CodeBuild to use for running our tests.  Since the tool we use for testing requires Node.js, we will choose the appropriate image to have the necessary dependencies available to us.  So our updated mu.yml file now looks like:

environments:
- name: acceptance
- name: production
service:
  name: banana-service
  port: 8080
  pathPatterns:
  - /bananas
  pipeline:
    source:
      provider: GitHub
      repo: myuser/banana-service
    build:
      image: aws/codebuild/java:openjdk-8
    acceptance:
      image: aws/codebuild/eb-nodejs-4.4.6-amazonlinux-64:2.1.3

Apply these updates to our pipeline my running mu:

$ mu pipeline up
Upserting Bucket for CodePipeline
Upserting Pipeline for service 'banana-service' …

Commit and push our changes to cause a new run of the pipeline to occur:

$ git add --all && git commit -m "add test automation" && git push

We can see the results by monitoring the build logs:

$ mu pipeline logs -f
2017/04/19 16:39:33 Running command newman run -e env.json -r html,json,junit,cli src/test/postman/collection.json
2017/04/19 16:39:35 newman
2017/04/19 16:39:35
2017/04/19 16:39:35 Bananas
2017/04/19 16:39:35
2017/04/19 16:39:35  New Banana
2017/04/19 16:39:35   POST http://mu-cl-EcsEl-1K74542METR82-1781937931.us-west-2.elb.amazonaws.com/bananas [200 OK, 354B, 210ms]
2017/04/19 16:39:35     Has picked date
2017/04/19 16:39:35     Not peeled
2017/04/19 16:39:35
2017/04/19 16:39:35  All Bananas
2017/04/19 16:39:35   GET http://mu-cl-EcsEl-1K74542METR82-1781937931.us-west-2.elb.amazonaws.com/bananas [200 OK, 361B, 104ms]
2017/04/19 16:39:35     Status code is 200
2017/04/19 16:39:35     Has bananas
2017/04/19 16:39:35
2017/04/19 16:39:35
2017/04/19 16:39:35                           executed    failed
2017/04/19 16:39:35
2017/04/19 16:39:35               iterations         1         0
2017/04/19 16:39:35
2017/04/19 16:39:35                 requests         2         0
2017/04/19 16:39:35
2017/04/19 16:39:35             test-scripts         2         0
2017/04/19 16:39:35
2017/04/19 16:39:35       prerequest-scripts         0         0
2017/04/19 16:39:35
2017/04/19 16:39:35               assertions         5         0
2017/04/19 16:39:35
2017/04/19 16:39:35  total run duration: 441ms
2017/04/19 16:39:35
2017/04/19 16:39:35  total data received: 331B (approx)
2017/04/19 16:39:35
2017/04/19 16:39:35  average response time: 157ms
2017/04/19 16:39:35

Conclusion

Adopting continuous delivery for microservices demands the injection of test automation into the pipeline.  As demonstrated in this post, mu gives you the freedom to choose whatever test framework you desire and executes those test for you on every pipeline execution.  Only once your pipeline is doing the work of assessing the microservice readiness for production can you achieve the goal of delivering faster while also increasing quality.

In the upcoming posts in this blog series, we will look into:

  • Custom Resources –  create custom resources like DynamoDB with mu during our microservice deployment
  • Service Discovery – use mu to enable service discovery via `Consul` to allow for inter-service communication
  • Additional Use Cases – deploy applications other than microservices via mu, like a wordpress stack

Until then, head over to stelligent/mu on GitHub and get started!

Additional Resources

Did you find this post interesting? Are you passionate about working with the latest AWS technologies? If so, Stelligent is hiring and we would love to hear from you!