Get started with CI/CD for ML projects
Concepts
Software engineers may be familiar with the concepts of continuous integration and continuous delivery. The basic flow consists of pushing code to a Git repository that triggers a job to test the code and build the application in an automated way. One of the most famous open-source tool is Jenkins, but Cloud providers also have their own services, such as Cloud Build for GCP, or CodeBuild/CodePipeline for AWS. One of the main advantages of CI/CD is the automation of all the deployment tasks, which shorten software development iterations.
CI/CD for data-science is becoming a norm. Deploying models to production is not easy and DevOps engineers are bringing their expertise to the ML teams to simplify the process. Many of the lessons learnt by software engineering teams can be re-used, at the exception that in addition of testing code, ML teams also need to test data and evaluate models. A CI/CD workflow for data-science could look like this:
In the following sections, we will setup a simple Continuous Integration pipeline that you can re-use across projects. As a first step, it will be very similar to a standard Software Engineering pipeline, and in future articles we will see how we can enhance it.
What’s next?
In the next sections, we set up a simple pipeline continuous integration flow. This would cover the first steps of the deployment flow shown above.
A ML pipeline is composed of a few components for data preparation, training, model validation, inference, etc. A popular pattern is to package each component in containers as it is a good strategy to reproduce results without dependency on the hardware/OS it runs on. Wether you are running your containers on premise, or in the Cloud, it shouldn’t matter as your code will always be executed in the same Docker environment. Today, most of ML frameworks chooses this approach.
We will start with a Git repository on GitHub. We will create two dummy components with a corresponding Dockerfile and some unit tests. As any software product, your code deserves to be tested and starting writing unit tests from the beginning is a good habit to have. Then, we will configure an event on Git push to trigger a build job, in both GCP and AWS. In a real-world setting, you may not need to deploy in multiple Cloud providers, but this is just for demo.
Let’s get started
First, create your own GitHub repository. In this example, we use this repository.
Repository structure
Using notebooks to analyse data is very common, especially during the exploration phase. Adding them to your repository is important as it helps others to quickly understand your experiments. However, when building the end-to-end pipeline, the code must be re-organised to be “production grade”, including unit and integration tests. As a pipeline can have multiple steps (components), it’s a good practice to follow a common repository structure across your projects to make them easy to maintain and easy to deploy.
In this blog, and the following ones, we will follow a structure like this one:
In the “components” folder, we add all the components needed for the ML pipeline. Each component comes with its own testable code (app folder) a component definition (components.yaml — very similar to Kubeflow definition), a Dockerfile and a requirements.txt. In the “deployment” folder, we keep the scripts needed for the CI/CD pipeline. In this sample repo, we look at “building” our pipeline in both AWS and GCP, so you will see two subfolders “aws” and “gcp” in this deployment folder. For real-world project, you probably don’t need both.
Testing
Before setting up the CI pipeline, let’s confirm everything is running as expected. For each component, try to build the Docker image locally:
## In a component subfolder:
docker build . -t dataprepdemo
docker run --entrypoint=python dataprepdemo main.py --data-location s3://mylocation
Your docker image should be built. You can also try to test the (very simple) code using pytest:
pytest .
Now that everything is ready, let’s configure the pipeline.
GCP setup
Let’s go through the main steps with GCP before looking at AWS.
When we push changes to our GitHub repository, we trigger a Cloud Build job that tests our code and builds our component’s Docker images. The first step is to configure this Cloud Build trigger. The best way to proceed is to refer to the official documentation. If correctly configured, you should notice new builds starting every time you push something to your repo.
First pipeline
In the root folder of the repo, there is a file cloudbuild.yaml. This file describes the steps that Cloud Build must execute. Let’s keep our first version simple for now.
We have two components, and when we push to our repo we want to execute the unit tests (1 step each = 2 steps) in a Python environment, then build the Docker images (2 steps) and push them to GCR (2 steps). Our first cloudbuild.yaml looks like this:
steps:
- name: 'python:3.8-slim'
entrypoint: /bin/sh
args:
- -c
- 'cd components/dataprep/ && pip install -r requirements.txt && pip install pytest pytest-cov && python -m pytest --cov app/'
- name: 'python:3.8-slim'
entrypoint: /bin/sh
args:
- -c
- 'cd components/modeltraining/ && pip install -r requirements.txt && pip install pytest pytest-cov && python -m pytest --cov app/'
- name: 'gcr.io/cloud-builders/docker'
args: [ 'build', '-t', 'gcr.io/$PROJECT_ID/$REPO_NAME/dataprep', 'components/dataprep/' ]
- name: 'gcr.io/cloud-builders/docker'
args: ['push', 'gcr.io/$PROJECT_ID/$REPO_NAME/dataprep']
- name: 'gcr.io/cloud-builders/docker'
args: [ 'build', '-t', 'gcr.io/$PROJECT_ID/$REPO_NAME/modeltraining', 'components/modeltraining/' ]
- name: 'gcr.io/cloud-builders/docker'
args: ['push', 'gcr.io/$PROJECT_ID/$REPO_NAME/modeltraining']
Enhancing the pipeline
We have 2 components and our Cloud Build flow has 6 steps. If we had 10 components, our build would have 30 steps and it would become complex. We can notice that all the commands are very similar and only the component name changes. We know that all components are in the same folders, so we can automate the build by writing two bash scripts (one for pytest and one for the docker images) iterating over our components and executing the commands we want.
steps:
- name: 'python:3.8-slim'
entrypoint: /bin/sh
args:
- deployment/gcp/run-tests.sh
- name: 'gcr.io/cloud-builders/docker'
entrypoint: /bin/bash
args:
- deployment/gcp/build-dockers.sh
env:
- 'BRANCH_NAME=$BRANCH_NAME'
- 'PROJECT_ID=$PROJECT_ID'
- 'SHORT_SHA=$SHORT_SHA'
- 'REPO_NAME=$REPO_NAME'
The same cloudbuild definition can be used by all the projects following the same folder structure, independently of the number of components. Note that we kept the same images, python:3.8-slim and docker.
If you go through the bash scripts in deployment/gcp/,you can notice we add two tags to our Docker images before pushing them to GCR:
image_name="${folder%?}"
docker build -t gcr.io/$PROJECT_ID/demo-mlops/$image_name:$BRANCH_NAME-$SHORT_SHA -t gcr.io/$PROJECT_ID/demo-mlops/$image_name:$BRANCH_NAME-latest .
This is to group the images by branch. With the previous command, we were pushing our “component_name” image to gcr.io/PROJECT_ID/demo-mlops/component_name:latest. Every new build would automatically change the “latest” tag. We must have a better way to manage the tags to avoid that a pipeline in production uses the new latest image without proper testing. Adding the branch name to the tag is a way to make sure the production pipeline uses only an image that has been promoted (and tested+reviewed) to the main branch.
One more thing
We have achieved our initial goal. If we follow the folder structure that we have defined, we are able to test and build our pipeline components. There is one more thing, I would like to add for this first CI pipeline. To know the code coverage or the Docker images that have been built, we must read the Cloud Build logs to find out. Let’s change that and save these reports to Google Cloud Storage.
## Create your bucket.
gsutil mb -l LOCATION gs://PROJECT_ID-mlops-deployments
Let’s save the artifacts in this location: cloudbuild/REPO_NAME/BRANCH_NAME/DATE/BUILD_ID/ by adding an artifact section to our cloudbuild.yaml.
artifacts:
objects:
location: 'gs://$PROJECT_ID-mlops-deployments/cloudbuild/$REPO_NAME/$BRANCH_NAME/$(date +%Y-%m-%d-%H-%M-%S)/$BUILD_ID/'
paths: ['_artifacts/*']
Finally, we update the bash scripts to save all the artifacts in the folder “_artifacts/”. Push the changes and here we are, the artifacts are being uploaded and we can easily retrieve them.
Artifacts will be uploaded to gs://$PROJECT_ID-mlops-deployments using gsutil cp
_artifacts/*: Uploading path....
Copying file://_artifacts/dataprep-coverage.json [Content-Type=application/json]...
Copying file://_artifacts/images.txt [Content-Type=text/plain]...
Copying file://_artifacts/modeltraining-coverage.json [Content-Type=application/json]...
AWS setup
For AWS, we will follow the same build logic as GCP. However, as often with AWS, it requires a bit more work on setting up the flow.
CloudFormation
We use a CloudFormation template to setup the CodeBuild job triggered on Push event. The template is in deployment/aws/.
Assumption: You already have configured CodeBuild <> GitHub integration. If not, follow this section.
CodeBuild
The CodeBuild definition is in the repository. I will not go through everything here, but feel free to have a look at it and at the IAM permissions. We use 3 environment variables, the account ID to build the ECR repository base url. The project name (aka repo name), and the S3 bucket where we store our artifacts. We define a trigger on PUSH event, then GitHub as source.
BuildSpec: |
version: 0.2
phases:
install:
runtime-versions:
python: 3.8
commands:
- pip3 install pipenv
- . ./deployment/aws/export-branch.sh
build:
commands:
- mkdir _artifacts && mkdir _reports
- aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com
- /bin/bash deployment/aws/run-tests.sh
- /bin/bash deployment/aws/build-dockers.sh
- aws s3 cp _artifacts/ s3://$DEPLOYMENT_BUCKET/codebuild/$PROJECT_NAME/$BRANCH_NAME/$(date +%Y-%m-%d_%H:%M:%S)/$CODEBUILD_BUILD_ID/ --recursive
The bash scripts for AWS follow the same logic as the ones for GCP, and there are only minor changes, especially to push the images to AWS ECR.
Note the difference with GCP? We have made the choice to add an inline build flow. This is better because we don’t have to duplicate the same buidspec across all our ML repositories. Another advantage is that if the DevOps team wants to update the flow, it can do it without modifying the source repository. Let’s also upload the bash scripts to S3 and download them when the job is triggered. Again, this gives the flexibility to the team to modify the CI pipeline and apply the changes to all the repositories at once (ultimately, these scripts should be version controlled as well!). It also enables the team to have different scripts for different branches (dev, stage, prod, etc.)
GCP also supports inline cloud build, so feel free to follow the same approach!
Deploy the CloudFormation template.
The parameters should be:
- RepositoryName: Name of your GitHub repository
- RepoCloneUrl: HTTPS Clone URL of your repository
- DeploymentBucket: Bucket used to store the deployment scripts. If you don’t have any, deploy the CloudFormation template bucket.yaml first.
Once deployed, you should see a new CodeBuild job has been created. Try to push a change to your repository and you should see a new build being triggered.
Conclusion
We have seen how to create a simple Continuous Integration for your ML projects. We have used GCP Cloud Build and AWS CodeBuild, but there are other tools and products that you can explore. This demo repository is a just an example to get you started and you can enhance it in many ways, depending on your own requirements. Keep in mind this is only for CI, and in a future article, we will see how to include CD.