Smoke Checks during Apps V2 deploy ensures that machines are up and running consistently for 10 seconds after they are updated in fly deploy. flyctl v0.1.86 updates these checks to pass more quickly when deploying to apps with more than a handful of machine!
Previously, the smoke checks started a 10 second timer and ran a series of checks over those 10 seconds. The intent was to ensure that the machine stayed up and didn’t repeatedly exit over that time.
Since we added smoke checks we started batching updates for machines, and then run the smoke checks sequentially. This means the smoke test may run multiple minutes after a machine was updated. For example, with an app with 36 machines it will get updated in batches of 12. Then the smoke check will be run on each of those 12 machines sequentially. It will take ~110 seconds to get to the smoke check for the 12th machine in the batch. At that point, we don’t need to wait another 10 seconds observing the machine…
This change updates the smoke check to look for the most recent start event, and then use that to calculate the uptime. We pass the smoke check when the machine has an uptime greater than 10 seconds. For the batch deployment example above, this means the first machine in a batch will wait ~10 seconds for the smoke to complete, then the 2nd through 12th machines in the batch will pass the smoke checks nearly immediately—assuming they are up and not constantly restarting.
fly deploy dropped from 6.5 minutes down to a little under 2 minutes on a 36 machine app I tested. We have several internal services that fly deploy to apps like this, and we’re enjoying the faster deployments already!
Where does this batch value of 12 come from? I thought, rolling always deployed one machine at a time, while immediate deployed to all machines at once…
We do rolling deploys for 40+ machines, and it takes 10mins… how do I “batch” machines up for deploy?
The way fly deploy works today (since v0.1.43) for rolling and canary deployments:
fly deploy will update the first existing machine, wait for smoke checks, and if available health checks
Once successful, the rest of the machines are divided into three batches (if they don’t divide evenly an extra machine is added to earlier batches). In the 36 machine machine example, this means the three batches will have sizes: 12, 12, 11. (36 - 1)/3 = 11.66, rounds down to 11, and then we add one machine to each of the first two batches.
Then, flyctl runs the update on each of the 12 machines in the first batch
Once the update api calls are done, flyctl then sequentially waits for the smoke checks and health checks for each machine
Once the first batch is done, it repeats steps 3-4 for the second batch
After the second batch is done, it repeats steps 3-4 for the third batch
The batching is enabled by default.
The healthcheck interval can impact deploy times since we wait for those to pass in non-immediate deployments (when healthchecks are present). I often reduce the interval down to 5s–15s, so they quickly get updated after a machine is updated.
We have discussed internally some options for running healthchecks more frequently automatically after an update so developers don’t have to have a consistent low interval. We’ll make an announcement when we get around to implementing that!
That wait timeout defaults to 2 minutes, and can be set to something else with the --wait-timeout flag. For example, this would set the wait timeout to 10 minutes: fly deploy --wait-timeout 600.
Yep, we run the smoke checks. A lot of apps don’t have health checks, which motivated us to add the smoke checks. We wanted folks to get a basic “is the machine healthy” check before we proceeded to update other machines during non-immediate deployments.
The challenge ahead of us internally is more about the infrastructure & apis for healthchecks, and the knobs it provides for tuning things like this. For example, today we would need to call the machine update api to change the interval back-and-forth, which would result in the machine restarting multiple times. That wouldn’t be a great experience for deployments. There are ways for us to do something different… we just need to implement them
Once we have that hammered out, I expect we’ll adjust the default behavior of fly deploy to speed up health check intervals right after individual machine updates and provide a config option like the one you suggest for developers to adjust that post-deploy interval to their needs.