We fixed a nasty deployment bug for apps with health checks

A short while ago we introduced a bug that caused (under very specific but racy circumstances) machines with health checks to disappear from fly-proxy roster shortly after deployment and no longer being considered by the proxy during routing.

The symptoms of the bug were: not all of the machines in an app served requests, or, if your app had only one machine, the app suddenly became unavailable after deploy until the machine is restarted.

The bug affected only newly created machines or machines with newly added health checks. The health check also had to be configured with very small grace_period to trigger the bug.

Any change to the machine (start → stop, stop → start) or health check status (critical → passing) automagically “fixed” the problem and made the machine available for routing again.

The bug mostly affected blue/green deployments because:

  • they require health checks
  • they always create new machines

Well, this should be fixed now!