Cause of instance restart unclear

hello,

I’m occasionally seeing that my instances of my server nikola-healthchecker are restarting every now and then. It’s not inherently an issue, but I wanted to see why this was happening to make sure I’m not missing a bigger issue. As far as I can tell, there’s no way to do this. Specifically, when this happens I’ve tried pulling up the logs quickly but didn’t see anything in the time window.

Any ideas how to look in to this? And to be clear, I’m assuming that the restart is caused by some bug in my code. I don’t have information to suggest there’s a Fly bug here.

Thank you

David

The restarts count in the status output increases when the app process exits and our supervisor starts it back up again. You can see the times these happened with flyctl status instance <id>, but the logs are not easy to catch.

If you’re not using something like Sentry to catch exceptions already, I’d look at dropping that in. That will frequently help catch exceptions. At some point we’ll have nicer log search so you can see these yourself, but the logs may not actually have much information.

I just looked at the instance status for your app, it’s exiting with code -1, which is a weird code to exit with. I didn’t see any log messages when it restarted, either, it just stopped. This makes me think it’s an OOM, your VM is killing the process because it can’t reclaim memory from it.

This is a Node.js app, right? On a 512MB memory VM, you might want to set --max-old-space-size to something like 460MB, instead of the default 512MB.

Thanks Kurt! So it’s actually a tornado app. I had checked the graphs and it wasn’t obvious there was a memory issue.

I don’t presently have sentry enabled for these instances but I will add that. Does your advice on the param to set still apply?

Thanks!

Oh right, not Node. Memory failures on these size apps don’t really graph well. Gradual leaks are easy to see over time, but out of memory conditions can be almost instant. We have some tooling to detect OOMs but it’s not always reliable.

I’m not sure what memory parameters you can tune with Python but it’s definitely something to look at.

Ah my OOM info is outdated, you would probably see an OOM error in the status output. Exit code -1 is most likely caused by something else within the app.

1 Like

Okay thank you. I’ll keep looking. Also it looks like I do in fact have Sentry set up but it has a low sample rate, so I’ll need to wait a bit to see this get logged.

I dug a little more and it looks like logs for your app stop entirely about 3 seconds before the exit. Then it started showing boot messages 16 seconds later. It’s almost like the process just hung.

Thank you @kurt . I have a theory that this might be related to a bug where I burned through file descriptors. Do you happen to know what the system limit is?

Oh we track FDs, it doesn’t look like you hit the limit:

It looks like the limit in your VMs is ~46k.

darn! Thank you.

So I’ve been able to get some Sentry data now. I see a few seconds of no output and then it gets killed with

KeyboardInterrupt

I assume that’s Fly? Btw, unless there’s a great reason for that particular interrupt, it might be nice to have the interrupt be something that more obviously is from fly.

That’s actually a SIGINT. If you’re getting that, it probably means the health check is failing and we’re trying to restart it. Which would make sense if it’s just hung up.

It is kind of funny that Python calls that a KeyboardInterrupt exception.