Improved bluegreen deployments

Over the past few weeks, we’ve shipped a bunch of changes which should make bluegreen deployments work much more consistently and reliably for everyone.

The most common issues we were seeing were:

  1. Number of machines in an app increasing exponentially on subsequent deploys due to machines from previous failed deployments still hanging around.
  2. Intermittent downtimes during deployments caused by slow propagation of data across our fleet.
  3. Failed automatic rollbacks, which can end up nuking all machines from an app.

We ran into these issues ourselves, once ending up with 700 extra virtual machines because flyctl failed to acquire machine leases and properly rollback failed deployments!

Here’s a log of all the changes we’ve made.

  1. We made it extra fast, by adding new concurrency controls.
  2. We slowly ease out old machines by gracefully shutting them down and disabling traffic from being forwarded to them while they wind down.
  3. We safely rollback the deployment by deleting all the new machines and leaving your old machines untouched.
  4. We now detect if your app is running different image versions (e.g. due to a previously failed deployment) and provide you detailed information on how to manually resolve your app. (see help message below)
Watch your deployment at https://fly.io/apps/cherrypicker-yvonne-90/monitoring

Updating existing machines in 'cherrypicker-yvonne-90' with bluegreen strategy

Verifying if app can be safely deployed
  Found 2 different images in your app (for bluegreen to work, all machines need to run a single image)
    [x] cherrypicker-yvonne-90: deployment-01HVPFF2C2Y5ZES3BR7F8GYVW3 - 4 machines (1857507c6540d8, 48ed67dc0e7528,2865135b54ed38,7811545a535268) 
    [x] cherrypicker-yvonne-90: deployment-01HVPF1P5933VRBAGSGCN9K06V 4 machines (1857e3ef101798, 48ed67dc0e75e8, 683d47dc547358, 683d47dc547368) 

  Here's how to fix your app so deployments can go through:
    1. Find all the unwanted image versions from the list above.
    2. For each old image version, run 'fly machines destroy --image=<insert-image-version>'
    3. Retry the deployment with 'fly deploy

Deployment failed after error: found multiple image versions 

Error: found multiple image versions 

Update flyctl to the latest version and run fly deploy --strategy=bluegreen to see these changes in action. Let us know if you run into any new issues :rocket:

8 Likes

It was a pretty serious issue with our automated deployment workflow. Nice to see it fixed!

1 Like