Oh great question. This is something we’ve been thinking about, but we don’t have a great answer. We do have a workaround, though, and it’s a script check. What I’d probably do is put a script check on your public port like this:
[[services.script_checks]]
interval = 10000
timeout = 1000
command = "/fly/check-nodes.sh"
restart_limit = 0
That command could just be a curl, or a more sophisticated script.
Instances restart if there are 5 failures within 25 seconds. You can control the count with restart_limit
. If you set it to 3, for example, it’ll restart after 3 failures within 15 seconds.