Finally, BlueGreen Deployments for AppsV2! 🚀

We just added bluegreen deployment strategy for appsV2 and I’m here to tell you all about it.

:gear: How does it work (highlevel)

Assuming your app has 5 machines & you initiate a deployment,

  1. We boot up 5 new (green) machines running your new release.
  2. Once the green machines have started, we run healthchecks to ensure they’re ready for production traffic.
  3. If all healthchecks pass, we then mark the green machines as ready for traffic and tear down the previous(blue) machines
  4. Profit?

:question:How do I use it

It’s easy, but first, you need to update your flyctl to the latest version.

After, you can either set bluegreen as the deployment strategy in your fly.toml

[deploy]
  strategy = "bluegreen"

Or, you can just pass it an argument to your fly deploy command.

fly deploy --strategy=bluegreen

:wrench: What’s actually happening? (for the curious)

We made 0 changes to our proxy for this feature. Our proxy has no idea which machine is green or blue. Rather, we’re using 2 new machine apis we recently added to hide/unhide machines from the proxy. Flyctl, our chief orchestrator, only checks health and toggles switches on or off.

You’re probably wondering how we rollback safely, I’ve got details for you too. We keep your blue deployment alive till we have 100% confidence in the green deployment. If for any reason we are unable to get the green machines into a healthy state, we undo everything we’ve done and bail out. This means your old deployment stays intact & untouched. This is also means, there’s a brief period where your 2 deployments can serve traffic, but it’s very brief.

And so, till the green machines are healthy, they’ll remain hidden from the proxy. Once they’re healthy, we put them in the spotlight. If something goes wrong, we cleanup and bail out. Nice & easy.

:scroll: Notes

  • It’s quite fast, so if you have a large deployment, you should absolutely try it out!
  • All the orchestration is conducted entirely by flyctl. You know it, you love it!
  • It’s an all or none strategy. If any of the steps fail, flyctl will rollback safely to your previous release.
  • You may see extra health-checks on your dashboard during a deployment. Once the deployment completes, we’ll clean them up. There’s ongoing work to remove this behaviour.

So, update your flyctl, try it out, and let us know what you think. We are always listening!

24 Likes

This is huge, thank you!

After trying it a few times, I noticed that one/some of the health checks seem to get stuck on unchecked and eventually error:

Waiting for all green machines to be healthy
  Machine ... [app] - 1/1 passing
  Machine ... [app] - unchecked
  Machine ... [app] - 1/1 passing
Deployment failed after error: could not get all green machines to be healthy: wait for goroutine timeout

Hi, thanks for releasing this.

How does it work with volumes?

Let’s say I have one machine, I’ve created two volumes, with the intention that one is to be attached to the existing machine, and the other to be used for the green machine during deployment.

Unfortunately it doesn’t seem to work that way. The deployment simply fails with an error about the volume being attached to the existing machine, despite having an unattached volume available. And yes both volumes were created using the same volume name.

Any ideas?

1 Like

This is awesome!

1 Like

We’re consistently seeing this same issue. Where one of the machines (in our case, out of 9) is stuck in unchecked. Our most recent run was the same, but one of the machines was stuck in “created” where the others were all started.

Deployment failed after error: could not get all green machines into started state: timeout reached waiting for machine to started failed to wait for VM e784459b421d58 in started state: Get “https://api.machines.dev/v1/apps/data-app/machines/e784459b421d58/wait?instance_id=01H43W6E3DF6GYJ8206BQSDEP6&state=started&timeout=60”: net/http: request canceled

follow up: the logs for the “unchecked” machine indicate that it passed healthcheck.

@doncote @2q31 looking into it now!

This bug should be fixed in Release v0.1.44 · superfly/flyctl · GitHub.
Update your flyctl’s & retry! (fyi @doncote @2q31)

Thanks for all the extra details @doncote :raised_hands:

2 Likes

Thanks for the heads up. Second attempt was successful. Woot!

First attempt produced the following error. This is from the CLI with 0.1.44.

Waiting for all green machines to be healthy
  Machine 17811092a20089 [app] - unchecked
  Machine 1781109ec09189 [app] - 0/1 passing
  Machine 3d8d9e31f05689 [app] - 0/2 passing
  Machine 3d8ddd3ced9068 [app] - 2/2 passing
  Machine 4d891742f44928 [app] - 1/2 passing
  Machine 4d891747c405d8 [app] - 0/2 passing
  Machine 6e82dd77f67978 [app] - 2/2 passing
  Machine 784e2d1b446948 [app] - 2/2 passing
  Machine e286065f30e986 [app] - 2/2 passing
Deployment failed after error: could not get all green machines to be healthy: failed to get VM 17811092a20089: resource_exhausted: rate limit exceeded

Should be fixed in the next flyctl release!

1 Like

It doesn’t work with volumes yet, we’ll need to think through the UX for this! Will update this thread as we progress.

1 Like

deployed another half dozen times on multiple apps using bluegreen and things are looking good. no issues and super super fast. love that all of the machines are created/started/healthcheck’d in parallel. thx @kwaw and team!

4 Likes

does this work with http service checks? seeing an error in my deploy

@catgirl yes it works with http_service checks. What you ran into was a unique issue.
We have a fix in place, it’s just pending code review. I’ll let you know once it’s merged!

Should be fixed in 0.1.45! Release v0.1.45 · superfly/flyctl · GitHub

Thank you, this is awesome.

I think I’m having an issue that might be related to this strategy. My app has scaled down to 1 machine (even though min_machines_running=2) due to failing checks. However, when I try to scale up using the fly scale command, the new machine starts with an additional health check (the blue-green one?), and once it’s (automatically) removed, the machine stops.

Hi, I’m having rate limit error with the following error message:

Deployment failed after error: could not get all green machines to be healthy: failed to get VM xxx: resource_exhausted: rate limit exceeded

Any ideas why?

@aarroisi what flyctl version are you on? Your version should be >= v0.1.45.

Also we’re having an incident https://status.flyio.net at the moment which directly affects the “state” used by deployment code. Kindly monitor the status page & retry when it resolves.

This is not only the fastest strategy but also the best strategy for zero-downtime (or rather zero-instability) deployments :ok_hand:

1 Like

Ah, got it. After updating to v0.1.45 it’s resolved. Thanks.

1 Like

I’ve been looking at this for a while & I’m not sure it’s related to blue-green deployments or the extra health-check. When you stop a machine with health-checks, the healthchecks will fail. This is not specific to bluegreen, you can try it out with another fly app using a say rolling or canary.

I think we have to review our autoscaling logic. It doesn’t make sense that it’s downscaling to 1 machine if min_machines_running is 2. I’ll report it to someone in networking & let you know what we find.

Plus we were/are having some issues with corrosion, which propagates shared state across our cluster. So some components could be seeing stale stuff. Either way we’ll get to the bottom of this behaviour.

We just had a deploy using the new bluegreen strategy fail but flyctl exited with code 0 …

...
--> Pushing image done
image: registry.fly.io/...
image size: ...

Watch your app at https://fly.io/apps/.../monitoring

Running ... release_command: ...
  release_command 148ed434c45748 completed successfully
Updating existing machines in '...' with bluegreen strategy

Creating green machines
  Created machine 4d89626b67d018 [app]
  Created machine 2874445c029208 [app]

Waiting for all green machines to start
  Machine 2874445c029208 [app] - created
  Machine 4d89626b67d018 [app] - started
Deployment failed after error: could not get all green machines into started state: wait for goroutine timeout

Visit your newly deployed app at https://....fly.dev/

Is this something known, yet? For our CI pipeline it would be nice to get back an exit code other than zero :grinning: