Screencast: Continuous Delivery for Machine Learning with AWS CodePipeline and Amazon SageMaker

The Amazon SageMaker machine learning service is a full platform that greatly simplifies the process of training and deploying your models at scale. However, there are still major gaps to enabling data scientists to do research and development without having to go through the heavy lifting of provisioning the infrastructure and developing their own continuous delivery practices to obtain quick feedback. In this talk, you will learn how to leverage AWS CodePipeline, CloudFormation, CodeBuild, and SageMaker to create continuous delivery pipelines that allow the data scientist to use a repeatable process to build, train, test and deploy their models.

Below, I’ve included a screencast of the talk I gave at the AWS NYC Summit in July 2018 along with a transcript (generated by Amazon Transcribe – another Machine Learning service – along with lots of human editing). The last six minutes of the talk include two demos on using SageMaker, CodePipeline, and CloudFormation as part of the open source solution we created.

Hi there. It’s Paul Duvall co-founder, CTO at Stelligent. At Stelligent, we help our customers embrace DevOps practices such as continuous delivery on the AWS platform. We’re a Premier (AWS) partner, we have both the DevOps and Financial Services competencies and we’ve been working with AWS for a decade now. All our engineers are AWS certified, we work exclusively on AWS. We’ve worked with customers such, 3M, Verizon, The Washington Post, Citi and several other financial services institutions to name a few. So, I’m gonna be taking you through a solution that we’ve open sourced that allows you to perform continuous delivery with Amazon SageMaker, so I gave this presentation at the AWS New York City Summit in July 2018 but since there wasn’t recording of it, I’m going to make one now. So, the Amazon SageMaker Machine Learning service is a full end-to-end platform that really simplifies the process of training and deploying your models at scale, however, there is still some major gaps in enabling data scientists to do research and development without having to go through the heavy lifting of provisioning the infrastructure and developing your own continuous delivery practices to obtain that quick feedback. So in this talk you’re going to learn how to leverage AWS CodePipeline CodeBuild, S3 along with SageMaker to create, continuous delivery pipeline that allows the data scientists to use a repeatable process for building and training testing, deployment their models. So, the objective really is so that data scientists can spend at least most of their time on tweaking algorithms and experimentation not in procuring the compute resources, configuring them waiting days and weeks and months before learning the results of their experimentation. so it’s really all about fast feedback and more experiments.

So this is a vision that we often share with our customers when we first started working with them and that is can any authorized team member have idea in the morning have it confidently deployed to production in the afternoon of the same day and so the answer with pretty much all of our customers always is no – at least for the applications we’re working with but that’s you know the vision that we’re seeking so let’s break this down a little bit. Any authorized team members, this implies a cross functional team member that’s trusted by the team and then the idea could be a feature of fix, an experiment, basically any type of change it could be an application code change, configuration, could be infrastructure, data, and yes, it could be a change in like a machine learning model so really anything that makes up the software system and then it gets confidently deployed so that implies that it’s done in a secure manner, you have reduced risk that you have a single path to production – you know it’s going to go through the same approved process every time on its way to production – and then the afternoon implies that there’s a very short cycle time as it goes through that single path to production so and since it’s in production around talking about a toy or something trivial we’re talking about something that’s going to be in the hands of users.

So, I’m going to cover the basics of machine learning in this presentation along with how the Amazon SageMaker service helps simplify many of the challenges that data scientists face when building, tuning, training, deploying those models. We’re going to cover some of the basic tenets of Continuous Delivery and how he can use the continuous Delivery service, AWS CodePipeline to orchestrate automate end-to-end life cycle for machine learning. Finally I’m going to conclude with a couple of demos the first demo i go through the process of manually using Amazon SageMaker service to build in tune and train and deploy the models. Then i’m going to show you how you automate this entire process using AWS CloudFormation along with several other tools
So let’s talk about Machine Learning. Machine Learning is a subset of AI in which you can start making inferences without explicitly programming this behavior into code. So, instead what you do is you develop or you use algorithms that look for patterns based on usually lots and lots of data. So if you think about it there are three types of data driven activities of data driven development one is so the retrospective data, the other one is the right now right here right now kind of data, and then the other is the future or inferences you can make based on the data that you’re scouring. So, with a retrospective look on things you’re looking at the past right? So you’re providing analysis and reporting I would say this is probably most common look at data so in AWS you might be using services such as Redshift or RDS, or S3, EMR to analyze it and then something QuickSight to view this data. With the more the right here right now, what you’re doing is more real time processing and dashboards. So you might be using Amazon Kinesis or Lambda, CloudWatch dashboards on AWS in order to analyze and and view the data then finally with machine learning you start to look the future you could make inferences in real time to enable more smart applications and so your not really just reacting or also inferring and this is really where Amazon Machine Learning and SageMaker comes in.
So what does a typical machine learning workflow look like from the perspective of a data scientist and then also what are the types of use cases the data scientist is seeking to create this solution? So common use cases we see include Ad targeting, things like credit default predictions, supply chain demand forecasting, click through prediction and there’s many others. But those are some of the common use cases. And so often the first step in a machine learning workflow is collecting and preparing the training data that the models are going to use. And this involves getting the data from disparate sources. In some cases, this means converting from a an unstructured to a structured format, using each ETL services like AWS Glue, or categorizing the data, or making columns or data more consistent – things like that. The second is that the data scientist needs to choose and optimize their machine learning algorithm, and their framework. And so they might base it off of a pre-built algorithm, such Image Classification, XGBoost, or Factorization Machines. And then when they choose the framework their options such as TensorFlow, MXNet or CNTK. So, it’s not just choosing you also need to install it configure the drivers for the algorithms libraries frameworks and so on. Next you need to set up and manage your environments for training and on AWS this means you need to choose a instance type, configure IAM Roles, permissions, encryption, network configuration, hyperparameters, and really all sorts of things. This is much more challenging when you aren’t using a cloud provider. The fourth one is then you need to train in tune your model and this often includes a bunch of trial and error, ensuring that you have the right algorithm that you’ve tweaked all the hyper parameters and the data so that you can start to make the right inferences. So next you need to deploy your model into production this is where you need to make your model available through an endpoint. So you need to establish your endpoint configuration, your end point you need to ensure it’s secure and so on. Finally you need to scale and manage your environments and this is where you need to anticipate plan for the type of scale your model needs. Here you’re going to be upscaling your computer needs and configuring it to adapt to that scale so in AWS you need to configure autoscaling you need to configure your environments for A/B testing and all that. So what’s the theme and most of what you see here. So from my perspective i see a lot of undifferentiated heavy lifting this is particularly difficult outside of AWS or another cloud provider so but it’s also difficult to do this on AWS only using Amazon Machine Learning or only using you know say EC2 instances someone’s your configuring on this stuff. According to a survey with data scientists, data scientists – as you might suspect – are happiest building the modeling data mining that data for patterns and refining their algorithms. Unfortunately, data scientists spend a majority of their time and really futzing with this infrastructure look at these these activities they aren’t the types of activities there around refining the algorithms and mining the data and things like that so it’s really run infrastructure configuration shared servers, laptops that’s or whatever the other workarounds they might use for training the models.

So with Amazon SageMaker, this really simplifies that whole process that I just showed you on the previous slide so with SageMaker you get a fully manage platform that enables developers and data scientists quickly and easily build, train, and deploy machine learning models at any scale. So SageMaker removes all the barriers that typically slow down developers who want to use machine learning so SageMaker allows you to create a host of Jupyter notebook instance at the click of a button then it provides built in optimized support for top 15 algorithms including XGBoost, Factorization Machines and Image Classifications it also supports the top frameworks and clean TensorFlow and MXNet and so this means that you can just choose an algorithm framework and SageMaker handles rest. So SageMaker helps with the tuning part using something known as hyperparameter optimization which is also known as HPO and so with HPO actually uses machine learning to improve that machine learning model and then you could deploy your models with SageMaker to secure endpoints in one click – it will can scale based on your needs and or it will scale based on its needs and you can configure it the scale and so forth. SageMaker handles hosting the autoscaling, and the deployment of your models and then all of this can be performed by providing configuration options and then just click a button. SageMaker handles that all this stuff under the covers so also SageMaker was built in a modular manner, so you can choose the use SageMaker to only perform one of these actions so like you can use it to just say deploy your model or just train your jobs or train your jobs and deploy your model and you have your Jupyter notebooks somewhere else for example so it’s up to you to choose how you how you SageMaker.
Based on what i just went through it’s pretty clear, that SageMaker really simplifies that process for building training, tuning deploy models but what’s missing from what I’ve shown. Well, the first thing is is repeatability so being able to run that that same configuration that you may have made to your training jobs or to the deployment of your model or to the endpoint really being able to reduce the errors by having that repeatability you can do that through automation and so you can even ultimately automate the end-to-end process of your models your, endpoints you your training jobs, even your notebooks instances. So, next one is how do you it integrate the work between what the data scientists are doing in what might be a silo with the rest of the software system. Also, how do you ensure that the work is available to others and not just a single data scientist. Then how do you test your configuration and then how do you validate your model. And, what I’ve shown so far, what triggers that training job or the deployment of the model to the endpoint in order to make those inferences. The answer is often the data scientist so they could be sitting on a model for days or weeks without anything triggering an integration other than them clicking the button. So, the last part is about security and governance so this often goes back to that consistency and everything being in version control and ensuring also there’s a process in place that occurs the same way every single time.
Let’s talk about continuous delivery, so AWS describes continuous delivery as “a DevOps software development practice, where coaches are automatically built, tested and prepared for a release to production it expands on continuous integration by deploying all changes to a testing environment and or a production environment after that build stage and so with continuous delivery implemented properly developers will always have a deployment ready artifact that has passed her a standardized test process” and Jez Humble the author of a book on continuous delivery, he describes continuous delivery as “making a release a boring push button activity that could be performed at any time” so i think that’s pretty concise.
So what you see in this diagrams that there’s a series of stages and really implied actions but just steps and these steps get automated is a part of a fully automated workflow on its way to production.
Then here is a bit more of a detailed view on the types of activities that we often get involved in with our customers when we implement continuous delivery. Including build and many types of static analysis tools and testing types and tools that you’re configuring running telemetry from dashboards and so on.

So AWS describes CodePipeline as “a continuous integration continues delivery service for fast and reliable application and infrastructure updates. CodePipeline builds, tests, and employs your code every time there’s a code change based on the release process models you define, so this enables you to rapidly and reliably deliver features and updates you could easily build out an end-to-end solution by using pre built plugins for popular third party services like GitHub or integrating your own custom plugins into any stage of the release process. So with an AWS CodePipeline you’re only going to pay for what you says no front fees or long term commitments” just really like most things at AWS. So you can integrate CodePipeline into your existing build tests and deploy process as AWS mentions or you can use an entire suite of AWS developer tools to run your source with AWS CodeCommit or your build with AWS CodeBuild and your tests of the of AWS CodeBuild or you have lots of different choices when it comes to deployment – AWS CodeDeploy and Beanstalk and in the example I’ll show we use CloudFormation for provisioning the SageMaker models and endpoints and so forth. one of the nice benefits of CodePipeline is you it’s a fully managed service meaning you don’t need to worry with provisioning configuring the EC2 instances or containers or whatever so you just molly your release orchestration using the CodePipeline console or the SDK or through CloudFormation, which actual showing a bit.

So here’s a continuous delivery pipeline for Machine Learning what leverage is SageMaker. So the source stage is configured to pull a GitHub repository that contains all the code for the solution including the machine learning training and the testing and the infrastructure code that’s provisioned and launches a part of the pipeline. In the build and train stage, we first one cfn_nag which is an open source linting tool for CloudFormation that we developed intelligent and it looks for patterns which may indicate insecure infrastructure. Then we perform model training based on the image classification algorithm in the example that we’re using but you could use whatever – all you have to do is just make some modifications – and so in this action we make a call out to the CodeBuild serves which then has a script that makes a call to a Python script we wrote that downloads the train validation data in order to actually perform the training. So we configure to run quickly for this demo so we have it calling a pipeline script that compares one image against another now that said you could potentially use like the output from CloudWatch to improve the train performance of the models well too so you know where you’re using and really that’s the intent of something like this is that you’re using the pipeline in order to not just a build but train not just provisioning configure but also trained the model itself so it could be running for hours depending on what you might be doing. So in the QA stage we actually launched the endpoint by making a call to a CloudFormation template which creates a stack that provision a SsageMaker model the endpoint configuration and a secure endpoint and this endpoint is used to make inferences on that model through queries and then in the production stage we have approval process – it’s a manual approval process – still part of a fully automated workflow but someone needs to look at the QA environment so yeah we’re good to go they had approved and then it goes and launches the production endpoint and then test that as well.

So let’s get into a demo, the first demo I’ll be showing is the manual actions that you would perform in creating your notebook instances, creating training jobs and then also creating your models and then endpoints. So go through this really fast, as, but it’ll give you a flavor and idea for what you might do if you need to use this use the SageMarker console in order to create all these resources. So first thing we do do is create the notebook instance. We’ll give it a name well given instance type, and then we could actually open that, then we can use the SageMaker example. In this case, we’re going to use the image classification algorithm, and so this is it’s in the Elastic Container Registry. Create a copy of that, and now we’re going to open to the Jupyter notebook, and this is image classification example. You can see the data preparation, the parameters that you can use and configure the actual training of the models. How to make inferences, screen endpoint, configuration, testing and so forth. And then we could shut down that instance when we’re done using it. Then, you can perform lifecycle configurations against all your notebooks. You can create training job, we’re going to download something from S3 and use that, and this is both the training and the validation data that we’re putting into S3 that available, but that would be coming out the notebook, for example, you ever creating a training Job, we’re going to give it an IAM Role. I’m going to choose one of the pre-built algorithms, this case, image classification. We’re going to configure the volume also the stopping condition, basically 24 hours. They’re going to choose the instance typing and used for training, and we’re going to figure some the hyperparameters, and this is really tuning our algorithm in the model. We’re going to choose the content type, we’re going to set up the training model and we’re going to point to the location where we’ve uploaded that now we’re gonna have some validation data. We’re going to choose effectively the same options as we did in the training, but it’s just different set of data here for validation. We’re done with this. Now we have a training and validation data. We create the training job, you can see of the configuration of hyperåparameters so forth, you can also too that’s the HPO. Now we are going to create the model, really, just the configuration of the modell. We’re going to point to the location of the model that was generated from the job, and then we’re going to create an endpoint configuration. And then were going to finally create the endpoint. By creating this endpoint, we can start to make inferences from the data and from the model. Okay, so now we can query the endpoint any point time and be able to make inferences based on that data.

Okay, so the last time we’re going to see is now that we’ve gone through that manual process you’re going to see a fully automated implementation of this solution to enable continuous delivery so that means that every time someone has a commit it’s going to go and run through and train your model is a part of the process but so it’s going to run some checks against some of the configuration code. It’s going to tests endpoints and anything that you would want to do to verify and validate your model is operating correctly, you can could run queries against endpoints and so forth. Okay, so we’re going to run through this pretty quickly as well, but this is an end to end and solution on how how to perform continuous delivery for Amazon SageMaker – but it should give you a gist of all the different components that make up the automation for the solution. So first thing we do is look at the CloudFormation code this is the could build project we’re running cfn-nag is a part of that CodeBuild that CodePipeline action with could build we have a project for actually performing the model training you can see it’s calling out to Python script that performs the the training job, then we’re going to launch an endpoint and then test that endpoint so we have a Python script in which we’re testing then you’re looking at the actual pipeline itself you could see the various stages and actions that are configured in the CloudFormation template. That creates that visualization that you saw before so now we’re going to launch the CloudFormation Stack, we’re doing this from the command line but you could go to the console, and run it there is well. You can see that it’s running right now and then once it’s complete (might take a few minutes to do that) then you can launch the pipeline itself so now we have the all the resources in place it’ll run through all these steps every single time and so we’re going to make some code changes and so we’re going to that training python script we’re going to change the number of epochs which is really the number times is going to be going through the training data so we changed the number of epochs and then we’re going, to you’ll see an indication of the pipelines so it’s going to find the changes from GitHub it’s going to run through cfn-nag using via CodeBuild it is going to use Codebuild to run that training.py file which that program that we just looked at before. That’s going to create the chain job download the data and so on it’s going to actually train the model and then we’re gonna launch the SageMaker endpoint we’re going to do that through a CloudFormation Deploy provider, and you can see that’s create the endpoint and the endpoint configuration. So we can query that end point. And so, in the next action, we should test the endpoint. So we have a test.py file that verifies image matches another one. And this is where it’s called via CodeBuild And you can see the test that we’re running right here. We’re just comparing two different images. And then we have an approval process so someone needs to actually look at the QA environment and then review it on then approve it and then it’s still part of a fully automated workflow so everyone every time someone as a change is gonna go through this process, they approve it launches the endpoint in production and then it’s going to test the endpoint so it calls the launching through CloudFormation and then the test endpoint gets invoked via CodeBuilt. That’s the end-to-end process of provisioning so you’re a single CloudFormation command it provisions all your resources of provisions all the CodeBuild projects, IAM, S3, it provisions CodePipeline, it provisions – as CodePipeline gets invoked, it provisions SageMaker resources you saw the model of the endpoint configuration the endpoint and it’s running tests, it’s invoking CodeBuild in order to create training jobs and run test runs that analysis against your CloudFormation templates and so you can do anything that you do in release process and the the key, thing here is that it’s all running is a part of machine learning process.

Okay, so finally this is the URL for the open source implementation in GitHub: https://github.com/stelligent/sagemaker-pipeline. You can checkout more on our blog at stelligent.com/blog. We have lots of other Blog posts all around DevOps, Continuous Delivery on AWS, so i hope you enjoy this. And if you have any questions, you can reach out to us at Stelligent on Twitter @Stelligent or go to our website. You can reach out to us by selecting contact us. So, thanks a lot for watching.

Acknowledgements

Thanks to Harlen Bains and Marcus Daly for their contributions on the open source solution referenced in this blog post and screencast and to Jessica Giordano for editing what Amazon Transcribe generated.

Stelligent Amazon Pollycast