What happens when a health check start failing once a deploy has completed?
Will a node be taken out of commission?
What happens when a health check start failing once a deploy has completed?
Will a node be taken out of commission?
Will a node be taken out of commission?
If a health check is failing but the application still runs, you can configure the number of times it will be restarted before being rescheduled with restarts = 6
(or some other number. After the restart_limit is hit it’s taken out of commission.
If your app’s startup time is getting in the way of your health checks, then you can use the grace_period
field to have it wait a number of seconds before doing the health check. You can also use restart_limit = 0
to keep it up (this is the fly.toml default)
If your app is crashing in a way that the health checks aren’t triggered, then we’ll restart it with exponential backoff between restarts. Here’s a thread with a great explanation. Note that backoff behavior is something that may change a bit as we build out fly machines
for more orchestration stuff.
My goal was to make sure that a region was automatically disabled the next time a situation like this happens:
We now have a health check that runs every 10 seconds checking upstream DNS, so it sounds like that will be sufficient to mitigate that. Does that sound right to you?