What do health checks do when a deployed instance starts failing?

eric1 · August 12, 2022, 3:43am

What happens when a health check start failing once a deploy has completed?

Will a node be taken out of commission?

eli · August 12, 2022, 4:04pm

Will a node be taken out of commission?

If a health check is failing but the application still runs, you can configure the number of times it will be restarted before being rescheduled with restarts = 6 (or some other number. After the restart_limit is hit it’s taken out of commission.

If your app’s startup time is getting in the way of your health checks, then you can use the grace_period field to have it wait a number of seconds before doing the health check. You can also use restart_limit = 0 to keep it up (this is the fly.toml default)

If your app is crashing in a way that the health checks aren’t triggered, then we’ll restart it with exponential backoff between restarts. Here’s a thread with a great explanation. Note that backoff behavior is something that may change a bit as we build out fly machines for more orchestration stuff.

eric1 · August 12, 2022, 6:46pm

My goal was to make sure that a region was automatically disabled the next time a situation like this happens:

We now have a health check that runs every 10 seconds checking upstream DNS, so it sounds like that will be sufficient to mitigate that. Does that sound right to you?

Topic		Replies	Views
HTTP Health checks failing, but not restarting app	5	1027	July 25, 2023
Critical health check, but app not restarting? Questions / Help wishlist , appsv2	2	458	December 14, 2023
Cluster leader failing health checks waiting for CPU Questions / Help	6	546	August 15, 2023
Deployment doesn't stop when health checks fail Questions / Help	1	182	August 31, 2023
How do you troubleshoot http_checks? Questions / Help	9	1569	February 9, 2022

What do health checks do when a deployed instance starts failing?

Related topics