We recently experienced super slow load times of ~20s for request that usually take less than ~100ms. Our server is situated in the EWR region, and hasn’t been redeployed in over a day or so. In checking the status of our server, we noticed that it was restarted ~15m ago. Would love some additional info as to what might have happened/cause our server to restart, and how best to handle an event like this in the future.
That restart column indicates that either the process crashed, or the health checks failed and was restarted. If enough of those happen it’ll actually replace the VM entirely.
The best way to check this is with flyctl logs -i <instance id>. Our log feature is somewhat rudimentary, but usually those restarts have a stack trace or something.
Also, if you run flyctl status --all you’ll see VMs that are no longer running. If they’re less than a few days old the logs command might still show you the last of their output.
Thanks Kurt, It looks we had a failed instance with a total of 6 restarts. In looking at the logs, it looks like our instance went from a health check status of “passing” to “critical” a few times. Is there any way to diagnose what caused this to happen?
There’s not much beyond the app logs, unfortunately. When it goes critical it means the process isn’t responding to network connections (or http checks, if you have http checks configured in fly.toml). If the app just hung and didn’t log anything, there’s no real way for us to see why.
If you’re worried about that happening again it’s worth setting flyctl scale set min=2 to make sure there are always 2 VMs running. When one fails health checks we’ll send all the requests to the other until it recovers or is replaced.