We received an outage notification (from our third-party prober) on a toy app in EWR. It was just down for 5-10 minutes and has recovered.
This app is singly homed, so it’s expected that we might see blips of outage when the app is restarted. Indeed we caught what appears to be a re-launch happening in logs (albeit with an unexpectedly long delay at “unpacking image”):
$ fly logs
Waiting for logs...
2022-10-28T16:37:44.388 runner[19fcedce] ewr [info] Starting instance
2022-10-28T16:37:45.374 runner[19fcedce] ewr [info] Configuring virtual machine
2022-10-28T16:37:45.401 runner[19fcedce] ewr [info] Pulling container image
2022-10-28T16:39:05.365 runner[19fcedce] ewr [info] Unpacking image
2022-10-28T16:43:22.812 runner[19fcedce] ewr [info] Preparing kernel init
So all is better now, except, what just happened? We’d like to make that one of the first & most definitive parts of our runbooks.
Question: Where would we go to understand why this restart happened? For example, did this happen because of a health check failing, the process crashing, a fly-internal reason (eg host machine died), or something else?
In this case I happen to suspect “fly-internal reason” but I cannot yet be sure. There is nothing in the “Activity” tab since our last deployment months ago, which is the place we were hoping to see something. Thanks!