project exited silently

This morning we noticed requests to one of our projects was failing. We checked the deployment and it showed the maintenance icon.

We checked the logs, there were no logs about it exiting. Just the logs for the previous request.

So we just restarted and it went back online.
Here are the logs, the last successful request and then when we restarted.

2023-01-10T15:25:28.762 app[47e967c0] gru [info] {"latencyInNs":228000000,"level":"info","message":"POST /token 200 228ms","method":"POST","statusCode":200,"url":"/token"}

2023-01-10T16:42:18.004 runner[47e967c0] gru [info] Starting instance

2023-01-10T16:42:32.307 runner[47e967c0] gru [info] Configuring virtual machine

2023-01-10T16:42:41.891 runner[47e967c0] gru [info] Pulling container image

2023-01-10T16:46:31.809 runner[47e967c0] gru [info] Unpacking image

2023-01-10T16:46:58.481 runner[47e967c0] gru [info] Preparing kernel init

2023-01-10T16:48:14.311 runner[47e967c0] gru [info] Configuring firecracker

2023-01-10T16:48:16.056 runner[47e967c0] gru [info] Starting virtual machine

2023-01-10T16:48:16.281 app[47e967c0] gru [info] Starting init (commit: f447594)...

2023-01-10T16:48:16.355 app[47e967c0] gru [info] Preparing to run: ` pnpm run start` as root

2023-01-10T16:48:16.389 app[47e967c0] gru [info] 2023/01/10 16:48:16 listening on [fdaa:0:3bd8:a7b:1f63:47e9:67c0:2]:22 (DNS: [fdaa::3]:53)

In Graphana it looks like the project was just off for that amount of time.

My question is, why didn’t the deployment restart? If something failed it should show those logs and restart. And what can I do so this doesn’t happen again.

Thanks for the help!

To persist logs one needs to setup fly-log-shipper.

What is restart_limit set to in your app’s health check (services.tcp_checks) fly.toml section? If health checks fail or the app OOMs, Fly’s control plane should ideally attempt to auto-restart the app restart_limit many number of times (afaik).

That said, in the past when apps have gone down without warning to never come back up, it has been due to VM (and volume) migrations slipping through the cracks when decommissioning lemon hosts.

One solution is to run at least 2 instances, possibly in different regions.

Hi @user121,

The host server where your application was deployed hit a Linux-kernel bug that required a reboot to resolve. We’ve been investigating this kernel bug and also looking into future improvements to our apps platform to help deployments automatically migrate away from unresponsive servers more quickly.

In general, if your application needs to be highly available, we recommend running two or more instances as @ignoramous suggested.

1 Like