Instance or service not restarted when I expected it to

davidhodge · July 25, 2022, 8:01pm

Hey all,

Starting Saturday evening, one of my Fly services running a single instance on a shared-cpu-1x with 512MB RAM was killed due to OOM. I was surprised by what happened next and would love some guidance on how to think about how Fly operates here and on what to do in the future. In case it’s relevant, this is essentially a worker where we only want one of them running at a time.

I expected Fly.io to try to boot a new instance or restart the existing instance, but it did not.
If it was not able restart the old instance or start a new one, I’d have expected to be notified in some manner.
Further, the status of the service, according to the fly dashboard seemed to either show “pending” and/or I think showed something in green, neither of which make sense for a service that wasn’t even running.

Mitigations so far:

I’ve upped the instance size to be dedicated and have 4GB of RAM.
We’ve put in better alerting on our existing systems.

Thoughts / ideas? Thanks!

greg · July 25, 2022, 8:17pm

This may not be the case here (as it shows as pending) however I found that when my app randomly failed (in my case a random spike but would be the same for memory) they did not get restarted. I too was expecting they would have.

The fix was simple: add a healthcheck to your fly.toml (tcp and/or http, depending on your app) and make sure that has a restart_limit set. Since the default is 0. Which disables restarts:

restart_limit: The number of consecutive TCP check failures to allow before attempting to restart the VM. The default is 0, which disables restarts based on failed TCP health checks.

See:

Upping the RAM to 4GB will also fix it. Just a more expensive way

davidhodge · July 25, 2022, 9:25pm

Really helpful, thanks Greg. I tried to investigate by triggering an OOM myself, and what I found was that the instance was restarted. And I hadn’t yet made a change to restart_limit! So I’m still a bit puzzled.

Perhaps they’ll restart failed instances if they fail health checks within a certain amount of time of boot up?

Additionally, I’m having a bit of trouble making sense of how health checks work in the context where we have 1 instance vs how I would expect them to act when we have many instances as a part of an autoscaled service. With a multi-instance service, if there is an instance that is no longer responding, I’d think they’d stop sending traffic to it and replace it with a new instance. So why isn’t that automatic when there’s just one service. Surely I’m missing something, but this is my mental model at least.

tj1 · July 25, 2022, 9:33pm

Yah, this is a bit confusing. The restart_limit hint is good to know though.

greg · July 25, 2022, 9:46pm

Hmm … Well that’s surprising. If you haven’t specified a restart_limit my understanding was that vms would not be restarted. It’s possible that has been changed and the docs not updated to reflect it, or maybe there are circumstances where the stars align and it does happen regardless. Like you say, perhaps there is a time/age component to it.

Given your vm showed as pending it may point again to that actually is happening, or at least tried to. Not sure.

As regards the load balancer model, I guess again it comes back to what “should” happen and whether people’s expectations may differ. Personally I would say the default should be to replace an instance that is not responding to a healthcheck. And yes, replace it. As an app with one instance that is now not available to handle requests … well, that’s not much use. That would be different to auto-scaling where the instance would be added or removed based on a load metric, rather than based on health. I was trying to think how AWS did it back when I had one instance behind an ELB … I vaguely recall when that instance failed (and hence failed the LB check on it), it too did not automatically get replaced/restarted by AWS. It needed an additional healthcheck to tell it to. Which would explain Fly’s approach.

wjordan · July 26, 2022, 5:19am

Hi @davidhodge,

One of our hosts in the sjc region triggered a bug in Nomad (our VM-orchestration service) which was preventing some instances from transitioning correctly, which may have been why your instance was stuck in ‘pending’. We restarted the service on the affected host which unblocked the stuck instances. If this was the issue, things should be cleared up by now and let us know if you continue to see any unexpected behavior.

As for the restart policy, note that the restart_limit setting only configures restarts triggered by health-check failures. Application-process crashes (including OOM-triggered exits) are triggered by a separate internal (not configurable) restart policy. The current policy <checks notes> will restart any exited processes up to 2 times within 5 minutes, then re-deploys the instance on another host if the process exits again. If the new deploy continues to fail, the instance will continue to get re-deployed indefinitely, with an exponential delay between 15 seconds - 15 minutes, capped at 15 restarts every 2 hours.

This is all very tied to Nomad’s built-in restart behavior, so the exact restart-policy details may change with Machine-based apps.