15 min gap from OOM kill to restart

hey folks,

On our service nikola-credential-updater, we’ve been experimenting with OOM events recently. BTW great work on that new email you send out about this! Anyway, Yes we dared to try to trigger such an event manually just to see what would happen!

I noticed that one of our servers took 15 minutes after being OOM killed before it looks like Fly restarted it. Is this 15 minute timeline what we should expect? Or is there something we can do to tighten it? Thanks!

Under most circumstances, the restart should be fast. If an app is in a crash loop, though, a couple of things happen:

  1. We will try and restart it in place ~3 times (depending on how fast it crashes)
  2. We will try and reschedule the VM
  3. Each time we reschedule the VM, we wait a little longer to boot the new one.

If you’re seeing 15 min delays it makes me think your app might be crashing repeatedly.

Does fly status --all show a bunch of failed VMs? You should be able to run fly vm status <id> on a specific one to see the events/OOM restarts.

1 Like

Thanks for your super fast and helpful reply as always Kurt. I will investigate.

@kurt that looks like it is exactly correct. We’ve got some digging to do on our end to figure out why this happened. It appears that somehow we were getting OOM killed more frequently than we thought. Thank you!

Instance
ID = (id)
Process =
Version = 44
Region = sjc
Desired = stop
Status = failed
Health Checks = 1 total, 1 passing
Restarts = 4
Created = 42m49s ago

Recent Events
TIMESTAMP TYPE MESSAGE
2022-08-13T00:27:50Z Received Task received by client
2022-08-13T00:27:50Z Task Setup Building Task Directory
2022-08-13T00:27:53Z Started Task started by client
2022-08-13T00:30:01Z Terminated OOM Killed
2022-08-13T00:30:01Z Restarting Task restarting in 1.241942018s
2022-08-13T00:30:07Z Started Task started by client
2022-08-13T00:32:15Z Terminated OOM Killed
2022-08-13T00:32:20Z Restarting Task restarting in 1.037377917s
2022-08-13T00:32:27Z Started Task started by client
2022-08-13T00:34:35Z Terminated OOM Killed
2022-08-13T00:34:40Z Restarting Task restarting in 1.151552615s
2022-08-13T00:34:47Z Started Task started by client
2022-08-13T00:36:55Z Terminated OOM Killed
2022-08-13T00:36:55Z Restarting Task restarting in 1.218158516s
2022-08-13T00:37:02Z Started Task started by client
2022-08-13T00:39:10Z Terminated OOM Killed
2022-08-13T00:39:10Z Not Restarting Exceeded allowed attempts 2 in interval 5m0s and mode is “fail”
2022-08-13T00:39:11Z Killing Sent interrupt. Waiting 5s before force killing