When an app instance is moved/restarted, how can we determine why?

We received an outage notification (from our third-party prober) on a toy app in EWR. It was just down for 5-10 minutes and has recovered.

This app is singly homed, so it’s expected that we might see blips of outage when the app is restarted. Indeed we caught what appears to be a re-launch happening in logs (albeit with an unexpectedly long delay at “unpacking image”):

 $ fly logs
Waiting for logs...
2022-10-28T16:37:44.388 runner[19fcedce] ewr [info] Starting instance
2022-10-28T16:37:45.374 runner[19fcedce] ewr [info] Configuring virtual machine
2022-10-28T16:37:45.401 runner[19fcedce] ewr [info] Pulling container image
2022-10-28T16:39:05.365 runner[19fcedce] ewr [info] Unpacking image
2022-10-28T16:43:22.812 runner[19fcedce] ewr [info] Preparing kernel init 

So all is better now, except, what just happened? We’d like to make that one of the first & most definitive parts of our runbooks.

Question: Where would we go to understand why this restart happened? For example, did this happen because of a health check failing, the process crashing, a fly-internal reason (eg host machine died), or something else?

In this case I happen to suspect “fly-internal reason” but I cannot yet be sure. There is nothing in the “Activity” tab since our last deployment months ago, which is the place we were hoping to see something. Thanks!

Any wisdom from the crowd on this one? As it is, it seems like apps can be moved (or die?) without a logged reason… (still hoping this is my user error)

It’s pretty normal for apps to move. It could happen because they’re crashing, but it might happen for other reasons as well.

You can typically see if there were issues by running fly status --all. This will show you if previous VMs were in a failed state. You can then run fly vm status <id> and get more details about what happened to a specific VM.

1 Like

Thanks for the response, Kurt!

I suppose the missing piece I am looking for is - does anything give the history of these events? Something like (date, app, event, instance, reason).

To put a finer point on it:

  • App rescheduled because of crash, oom, etc: Our problem, need to rootcause + fix.
  • App rescheduled because of some fly.io reason: Nice to know.

We don’t know how to distinguish these two (situation in the original post here).