Instance or service not restarted when I expected it to

Hi @davidhodge,

One of our hosts in the sjc region triggered a bug in Nomad (our VM-orchestration service) which was preventing some instances from transitioning correctly, which may have been why your instance was stuck in ‘pending’. We restarted the service on the affected host which unblocked the stuck instances. If this was the issue, things should be cleared up by now and let us know if you continue to see any unexpected behavior.

As for the restart policy, note that the restart_limit setting only configures restarts triggered by health-check failures. Application-process crashes (including OOM-triggered exits) are triggered by a separate internal (not configurable) restart policy. The current policy <checks notes> will restart any exited processes up to 2 times within 5 minutes, then re-deploys the instance on another host if the process exits again. If the new deploy continues to fail, the instance will continue to get re-deployed indefinitely, with an exponential delay between 15 seconds - 15 minutes, capped at 15 restarts every 2 hours.

This is all very tied to Nomad’s built-in restart behavior, so the exact restart-policy details may change with Machine-based apps.

4 Likes