App is dead, restart does nothing, no logs

Hi, we’ve got a long-lived app (last deployed about 8 months ago) that appears to have recently gone dead. flyctl restart doesn’t appear to do anything and flyctl logs are completely silent. Can someone have a look?

sure! would you mind including some additional info to help narrow this down a bit?

  • what does fly status --all say?
  • have you tried running fly logs with LOG_LEVEL=debug?
  • do you have a time estimate for when this app stopped working?(timestamps, minutes, hours, etc)

Sure, though I’d rather not post all of the requested info in public. Can we take this to DM or similar?

We believe the app was working as recently as last week, though we check it infrequently and can’t be certain.

completely understandable – feel free to redact any info from that ouput that you’re not comfortable sharing!

hey, just wanted to chime in here with a few follow-up tips:

There are a few things in flyctl you can use to inspect the behavior of individual VMs. This might turn up more information to help you debug!

For example, fly logs -i will show you output from a specific recent instance. This can come in handy in concert with fly status --all, whose display includes completed instances. You can also always use fly vm status <id>, which displays things like exit codes, health checks, etc.

Hopefully this is still helpful in its fully-redacted state.

Unfortunately, no instances (previous or current) are shown:

 flyctl status --all                                                                                 Update available 0.0.328 -> v0.0.330.
Run "flyctl version update" to upgrade.
App
  Name     = appname          
  Owner    = orgname     
  Version  = 37                
  Status   = dead              
  Hostname = appname.fly.dev  

Instances
ID	PROCESS	VERSION	REGION	DESIRED	STATUS	HEALTH CHECKS	RESTARTS	CREATED 

Here’s the log debug output:

 LOG_LEVEL=debug flyctl logs                                                                         DEBUG Loaded flyctl config from/Users/username/.fly/config.yml
DEBUG determined hostname: "hostname"
DEBUG determined working directory: "/Users/username/git/appname"
DEBUG determined user home directory: "/Users/username"
DEBUG determined config directory: "/Users/username/.fly"
DEBUG ensured config directory exists.
DEBUG ensured config directory perms.
DEBUG cache loaded.
DEBUG config initialized.
DEBUG initialized task manager.
DEBUG skipped querying for new release
Update available 0.0.328 -> v0.0.330.
Run "flyctl version update" to upgrade.
DEBUG client initialized.
DEBUG app config loaded from /Users/username/git/appname/fly.toml
DEBUG --> POST https://api.fly.io/graphql {{"query":"query ($appName: String!) { app(name: $appName) { id name hostname deployed status version appUrl platformVersion currentRelease { evaluationId status inProgress version } config { definition } organization { id slug } services { description protocol internalPort ports { port handlers } } ipAddresses { nodes { id address type createdAt } } imageDetails { repository version } machines{ nodes { id name config state region createdAt app { name } ips { nodes { family kind ip maskSize } } host { id } } } postgresAppRole: role { name } } }","variables":{"appName":"appname"}}
}
DEBUG <-- 200 https://api.fly.io/graphql (357.84ms) {"data":{"app":{"id":"appname","name":"appname","hostname":"appname.fly.dev","deployed":true,"version":37,"appUrl":"https://xxx.xxx.xxx.xxx","platformVersion":"nomad","currentRelease":{"evaluationId":null,"status":"succeeded","inProgress":false,"version":37},"config":{"definition":{"kill_timeout":5,"kill_signal":"SIGINT","processes":[],"experimental":{"allowed_public_ports":[],"entrypoint":[],"cmd":[],"exec":[]},"services":[{"processes":[],"protocol":"tcp","internal_port":8081,"concurrency":{"soft_limit":20,"hard_limit":25,"type":"connections"},"ports":[{"port":443,"handlers":["tls","http"]}],"tcp_checks":[{"interval":"15s","timeout":"2s","grace_period":"1s","restart_limit":6}],"http_checks":[],"script_checks":[]}],"env":{}}},"organization":{"id":"xxxx","slug":"orgname"},"services":[{"description":"TCP 443 ⇢ 8081","protocol":"TCP","internalPort":8081,"ports":[{"port":443,"handlers":["TLS","HTTP"]}]}],"imageDetails":{"repository":"appname","version":null},"postgresAppRole":null,"ipAddresses":{"nodes":[{"id":"ip_xxx","address":"xxx.xxx.xxx.xxx","type":"v4","createdAt":"2021-08-04T12:09:41Z"},{"id":"ip_xxx","address":"x:x:x::x","type":"v6","createdAt":"2021-08-04T12:09:42Z"}]},"machines":{"nodes":[]},"status":"dead"}}}
DEBUG --> POST https://api.fly.io/graphql {{"query":"query ($appName: String!) { app(name: $appName) { id name hostname deployed status version appUrl platformVersion currentRelease { evaluationId status inProgress version } config { definition } organization { id slug } services { description protocol internalPort ports { port handlers } } ipAddresses { nodes { id address type createdAt } } imageDetails { repository version } machines{ nodes { id name config state region createdAt app { name } ips { nodes { family kind ip maskSize } } host { id } } } postgresAppRole: role { name } } }","variables":{"appName":"appname"}}
}
DEBUG <-- 200 https://api.fly.io/graphql (343.22ms) {"data":{"app":{"id":"appname","name":"appname","hostname":"appname.fly.dev","deployed":true,"version":37,"appUrl":"https://xxx.xxx.xxx.xxx","platformVersion":"nomad","currentRelease":{"evaluationId":null,"status":"succeeded","inProgress":false,"version":37},"config":{"definition":{"kill_timeout":5,"kill_signal":"SIGINT","processes":[],"experimental":{"allowed_public_ports":[],"entrypoint":[],"cmd":[],"exec":[]},"services":[{"processes":[],"protocol":"tcp","internal_port":8081,"concurrency":{"soft_limit":20,"hard_limit":25,"type":"connections"},"ports":[{"port":443,"handlers":["tls","http"]}],"tcp_checks":[{"interval":"15s","timeout":"2s","grace_period":"1s","restart_limit":6}],"http_checks":[],"script_checks":[]}],"env":{}}},"organization":{"id":"xxx","slug":"orgname"},"services":[{"description":"TCP 443 ⇢ 8081","protocol":"TCP","internalPort":8081,"ports":[{"port":443,"handlers":["TLS","HTTP"]}]}],"imageDetails":{"repository":"appname","version":null},"postgresAppRole":null,"ipAddresses":{"nodes":[{"id":"ip_xxx","address":"xxx.xxx.xxx.xxx","type":"v4","createdAt":"2021-08-04T12:09:41Z"},{"id":"ip_xxx","address":"x:x:x::x","type":"v6","createdAt":"2021-08-04T12:09:42Z"}]},"machines":{"nodes":[]},"status":"dead"}}}
DEBUG --> POST https://api.fly.io/graphql {{"query":"mutation($input: ValidateWireGuardPeersInput!) { validateWireGuardPeers(input: $input) { invalidPeerIps } }","variables":{"input":{"peerIps":["x:x:x::x"]}}}
}
DEBUG <-- 200 https://api.fly.io/graphql (109.59ms) {"data":{"validateWireGuardPeers":{"invalidPeerIps":[]}}}

Definitely still useful! The output of fly status --all tells us that there aren’t any vms active for the past few days (which is why that field is blank).

So in this case you’d probably want to redeploy (or up the scale count) to create more instances.

OK, sure, we can do that, and I do want to thank you for your quick responses and for being helpful. But more important to us than getting this particular app running again is to try to get some insight into a) why it happened, b) what we can do to help prevent this from happening in the future, and c) why flyctl restart didn’t work.

Unfortunately, on our end, the lack of logs or any trace of a previous instance doesn’t provide us with any actionable information, so I was hoping someone from Fly.io could get to the bottom of it, or at least explain why this might have happened.

completely understandable!

So we do only keep app logs for two days-- if you need to keep them for longer, you can ship them somewhere else (perhaps something like the fly-log-shipper might fit into your stack).

We just now rolled out a change that should allow you to see failed VMs up to 7 days afterward.

This was previously only 2 days, so fly status --all might provide you with a little more insight now! You can then run fly vm status <id> on those instances and look for non-zero exit codes, restarts, etc.

On a related note, while it’s not at all a universal solution, running multiple instances of an app (ie fly scale count) can greatly improve reliability.

Thank you again for bringing this up!