Gradual Deployments with Serverless

A standard release practice at Inflection is to launch dark. Dark launches are important because they decouple software releases from feature releases, giving more control over the user experience. For example, we often release large features to production and make them available only to Inflection employees. Once employees use a particular feature with no problems, we offer it to select clients and quickly (less than a day typically) ramp it up to all customers. This is a best practice in software development.

As we moved to Serverless, we had to rethink this capability. Regular deployment of a Serverless application is an all-at-once process. When a new version of any of our functions has been released, all traffic will be routed to the new version. This is not what we would like. Just like non-serverless releases, we want to ensure that we can release with confidence and launch dark.

AWS has a feature that can make our deployment process much more reliable and secure: traffic shifting using aliases.

With the alias traffic shifting feature, version weights could be specified on an alias and only a certain amount of traffic will be routed to the latest version. This means that the Lambda alias will automatically load balance requests between the two latest versions, allowing us to check how the new release behaves before completely replacing the previous one, minimizing the impact of a possible bug.

Versioning

Automating the deployment process

With CodeDeploy we can specify how we want traffic to be shifted over time and it automatically adjusts the weights. There are three different types of deployment preferences:

  • Canary: traffic is shifted in two increments. You can choose from predefined canary options. The options specify the percentage of traffic that’s shifted to your updated Lambda function version in the first increment, and the interval, in minutes, before the remaining traffic is shifted in the second increment.

  • Linear: traffic is shifted in equal increments with an equal number of minutes between each increment. You can choose from predefined linear options that specify the percentage of traffic that’s shifted in each increment and the number of minutes between each increment.
  • All-at-once: all traffic is shifted from the original Lambda function to the updated Lambda function version at once.

This way we let CodeDeploy do all the heavy lifting in the deployment process and change alias weights according to our preferences. Traffic will be shifted gradually and, in the meantime, we can check if our new function is behaving correctly and cancel the deployment if we see anything weird. How could we automate the roll-back process, so that we don’t have to manually check how the system is performing? CodeDeploy has thought about that as well.

Rolling back

CodeDeploy allows us to configure a pre and a post traffic shifting hook, which are in fact Lambda functions that are triggered before and after the traffic shifting process. They’re suited for performing tasks like running integration tests/smoke tests. CodeDeploy expects to get notified about the success or failure of the hooks within one hour, otherwise, it’ll assume they failed. In any case, if the hook failed either for not calling CodeDeploy or for explicitly calling it with a failure response, the deployment will be aborted and all the traffic will be shifted to the old function version.

Hooks are not the only way we can check that our function is behaving as expected since we can provide CodeDeploy with a list of CloudWatch Alarms to monitor the deployment process. As the traffic shifting begins, CodeDeploy will track those alarms, canceling the deployment and rolling back to the previous function version if any of them is triggered.

Deployment process with CodeDeploy and Lambda weighted alias

Hooks and alarms allow us to monitor the whole deployment process. We can perform some tests before routing any traffic to the new function version, track alarms during the traffic shifting process, and run more tests right after all the traffic is hitting the new version. If CodeDeploy notices something wrong in any of those steps, it will automatically roll back to the old, stable function version.

Implementing in the Serverless.com Framework

Integrating CodeDeploy gradual deployment with the Serverless.com framework can be done using the serverless-plugin-canary-deployments plugin. In addition, the plugin serverless-plugin-aws-alerts can be used to set up custom alarms that will trigger rollback and send notifications to our email as well.

Below is an example Serverless.com template

service: sls-canary-example

provider:
  name: aws
  runtime: dotnetcore3.1

plugins:
  - serverless-plugin-aws-alerts
  - serverless-plugin-canary-deployments

custom:
  alerts:
    dashboards: true

functions:
  hello:
    handler: handler.hello
    events:
      - http: get hello
    alarms:
      - name: foo
        namespace: 'AWS/Lambda'
        metric: Errors
        threshold: 1
        statistic: Minimum
        period: 60
        evaluationPeriods: 1
        comparisonOperator: GreaterThanOrEqualToThreshold
    deploymentSettings:
      type: Linear10PercentEvery1Minute
      alias: Live
      alarms:
        - HelloFooAlarm

Some of these sections justify an explanation…

Alarms: These are CloudWatch alarms that are triggered by any errors raised by the deployment. They automatically roll back your deployment. An example is if the updated code you’re deploying is creating errors within the application. Another example is if any AWS Lambda or custom CloudWatch metrics that you specified have breached the alarm threshold. How to configure alarms

Hooks: These are pre-traffic and post-traffic test functions that run sanity checks before traffic shifting starts to the new version, and after traffic shifting completes.

  • PreTraffic: Before traffic shifting starts, CodeDeploy invokes the pre-traffic hook Lambda function. This Lambda function must call back to CodeDeploy and indicate success or failure. If the function fails, it aborts and reports a failure back to AWS CloudFormation. If the function succeeds, CodeDeploy proceeds to traffic shifting.
  • PostTraffic: After traffic shifting completes, CodeDeploy invokes the post-traffic hook Lambda function. This is similar to the pre-traffic hook, where the function must call back to CodeDeploy to report success or failure. Use post-traffic hooks to run integration tests or other validation actions.

Deployment Types CodeDeploy comes with a set of predefined deployment types. For an updated list, refer to the AWS documentation. Below is a summary.

  • Canary: a specific amount of traffic is shifted to the new version during a certain period of time. When that time elapses, all the traffic goes to the new version.
  • Linear: traffic is shifted to the new version incrementally in intervals until it gets all of it.
  • All-at-once: the least gradual of all of them, all the traffic is shifted to the new version straight away.

Example CodeDeploy Options:

  • Canary10Percent30Minutes
  • Canary10Percent5Minutes
  • Canary10Percent10Minutes
  • Canary10Percent15Minutes
  • Linear10PercentEvery10Minutes
  • Linear10PercentEvery1Minute
  • Linear10PercentEvery2Minutes
  • Linear10PercentEvery3Minutes
  • AllAtOnce

IAM permissions for deployer

The following IAM permissions are required for the serverless deployer.

{
  "Effect": "Allow",
  "Action": [
    "iam:GetRole",
    "iam:CreateRole",
    "iam:PutRolePolicy",
    "iam:DeleteRolePolicy",
    "iam:DeleteRole",
    "iam:AttachRolePolicy",
    "iam:DetachRolePolicy"
  ],
  "Resource": [
    "arn:aws:iam::*:role/*CodeDeployServiceRole*"
  ]
},
{
  "Action": "codedeploy:*",
  "Effect": "Allow",
  "Resource": "*"
}

Additional Resources

Updated: