When an app instance is moved/restarted, how can we determine why?

mikey · October 28, 2022, 4:56pm

We received an outage notification (from our third-party prober) on a toy app in EWR. It was just down for 5-10 minutes and has recovered.

This app is singly homed, so it’s expected that we might see blips of outage when the app is restarted. Indeed we caught what appears to be a re-launch happening in logs (albeit with an unexpectedly long delay at “unpacking image”):

 $ fly logs
Waiting for logs...
2022-10-28T16:37:44.388 runner[19fcedce] ewr [info] Starting instance
2022-10-28T16:37:45.374 runner[19fcedce] ewr [info] Configuring virtual machine
2022-10-28T16:37:45.401 runner[19fcedce] ewr [info] Pulling container image
2022-10-28T16:39:05.365 runner[19fcedce] ewr [info] Unpacking image
2022-10-28T16:43:22.812 runner[19fcedce] ewr [info] Preparing kernel init

So all is better now, except, what just happened? We’d like to make that one of the first & most definitive parts of our runbooks.

Question: Where would we go to understand why this restart happened? For example, did this happen because of a health check failing, the process crashing, a fly-internal reason (eg host machine died), or something else?

In this case I happen to suspect “fly-internal reason” but I cannot yet be sure. There is nothing in the “Activity” tab since our last deployment months ago, which is the place we were hoping to see something. Thanks!

mikey · November 9, 2022, 10:35pm

Any wisdom from the crowd on this one? As it is, it seems like apps can be moved (or die?) without a logged reason… (still hoping this is my user error)

kurt · November 9, 2022, 10:48pm

It’s pretty normal for apps to move. It could happen because they’re crashing, but it might happen for other reasons as well.

You can typically see if there were issues by running fly status --all. This will show you if previous VMs were in a failed state. You can then run fly vm status <id> and get more details about what happened to a specific VM.

mikey · November 9, 2022, 10:59pm

Thanks for the response, Kurt!

I suppose the missing piece I am looking for is - does anything give the history of these events? Something like (date, app, event, instance, reason).

mikey · November 15, 2022, 4:27pm

To put a finer point on it:

App rescheduled because of crash, oom, etc: Our problem, need to rootcause + fix.
App rescheduled because of some fly.io reason: Nice to know.

We don’t know how to distinguish these two (situation in the original post here).

Topic		Replies	Views
Unexpected Restarts metrics	3	753	September 17, 2020
Application VMs down without any change, can't deploy Phoenix	16	1308	October 3, 2022
Instance or service not restarted when I expected it to Questions / Help	5	1147	July 26, 2022
Is EWR down?	11	384	June 9, 2022
App status shows pending on dashboard elixir	2	393	July 19, 2022

When an app instance is moved/restarted, how can we determine why?

Related topics