What do health checks do when a deployed instance starts failing?

What happens when a health check start failing once a deploy has completed?

Will a node be taken out of commission?

Will a node be taken out of commission?

If a health check is failing but the application still runs, you can configure the number of times it will be restarted before being rescheduled with restarts = 6 (or some other number. After the restart_limit is hit it’s taken out of commission.

If your app’s startup time is getting in the way of your health checks, then you can use the grace_period field to have it wait a number of seconds before doing the health check. You can also use restart_limit = 0 to keep it up (this is the fly.toml default)

If your app is crashing in a way that the health checks aren’t triggered, then we’ll restart it with exponential backoff between restarts. Here’s a thread with a great explanation. Note that backoff behavior is something that may change a bit as we build out fly machines for more orchestration stuff.

My goal was to make sure that a region was automatically disabled the next time a situation like this happens:

We now have a health check that runs every 10 seconds checking upstream DNS, so it sounds like that will be sufficient to mitigate that. Does that sound right to you?

2 Likes