Cause of instance restart unclear

davidhodge · November 18, 2020, 10:56pm

hello,

I’m occasionally seeing that my instances of my server nikola-healthchecker are restarting every now and then. It’s not inherently an issue, but I wanted to see why this was happening to make sure I’m not missing a bigger issue. As far as I can tell, there’s no way to do this. Specifically, when this happens I’ve tried pulling up the logs quickly but didn’t see anything in the time window.

Any ideas how to look in to this? And to be clear, I’m assuming that the restart is caused by some bug in my code. I don’t have information to suggest there’s a Fly bug here.

Thank you

David

kurt · November 18, 2020, 11:19pm

The restarts count in the status output increases when the app process exits and our supervisor starts it back up again. You can see the times these happened with flyctl status instance <id>, but the logs are not easy to catch.

If you’re not using something like Sentry to catch exceptions already, I’d look at dropping that in. That will frequently help catch exceptions. At some point we’ll have nicer log search so you can see these yourself, but the logs may not actually have much information.

kurt · November 18, 2020, 11:31pm

I just looked at the instance status for your app, it’s exiting with code -1, which is a weird code to exit with. I didn’t see any log messages when it restarted, either, it just stopped. This makes me think it’s an OOM, your VM is killing the process because it can’t reclaim memory from it.

This is a Node.js app, right? On a 512MB memory VM, you might want to set --max-old-space-size to something like 460MB, instead of the default 512MB.

davidhodge · November 18, 2020, 11:47pm

Thanks Kurt! So it’s actually a tornado app. I had checked the graphs and it wasn’t obvious there was a memory issue.

I don’t presently have sentry enabled for these instances but I will add that. Does your advice on the param to set still apply?

Thanks!

kurt · November 18, 2020, 11:50pm

Oh right, not Node. Memory failures on these size apps don’t really graph well. Gradual leaks are easy to see over time, but out of memory conditions can be almost instant. We have some tooling to detect OOMs but it’s not always reliable.

I’m not sure what memory parameters you can tune with Python but it’s definitely something to look at.

kurt · November 18, 2020, 11:59pm

Ah my OOM info is outdated, you would probably see an OOM error in the status output. Exit code -1 is most likely caused by something else within the app.

davidhodge · November 19, 2020, 12:17am

Okay thank you. I’ll keep looking. Also it looks like I do in fact have Sentry set up but it has a low sample rate, so I’ll need to wait a bit to see this get logged.

kurt · November 19, 2020, 12:32am

I dug a little more and it looks like logs for your app stop entirely about 3 seconds before the exit. Then it started showing boot messages 16 seconds later. It’s almost like the process just hung.

davidhodge · November 19, 2020, 1:14am

Thank you @kurt . I have a theory that this might be related to a bug where I burned through file descriptors. Do you happen to know what the system limit is?

kurt · November 19, 2020, 1:22am

Oh we track FDs, it doesn’t look like you hit the limit:

It looks like the limit in your VMs is ~46k.

davidhodge · November 19, 2020, 1:59am

darn! Thank you.

davidhodge · November 19, 2020, 2:02am

So I’ve been able to get some Sentry data now. I see a few seconds of no output and then it gets killed with

KeyboardInterrupt

I assume that’s Fly? Btw, unless there’s a great reason for that particular interrupt, it might be nice to have the interrupt be something that more obviously is from fly.

kurt · November 19, 2020, 2:07am

That’s actually a SIGINT. If you’re getting that, it probably means the health check is failing and we’re trying to restart it. Which would make sense if it’s just hung up.

It is kind of funny that Python calls that a KeyboardInterrupt exception.

kurt · December 10, 2020, 11:50pm

@jerome has been working on some logging improvements that will (hopefully) give you more information when an instance craps itself. Stay tuned.

davidhodge · December 11, 2020, 12:03am

Thank you!

Topic		Replies	Views
Instance or service not restarted when I expected it to Questions / Help	5	1146	July 26, 2022
Unexpected Restarts metrics	3	745	September 17, 2020
scale count 15 but eventually no instances running (503 error) Questions / Help docs	2	609	December 16, 2022
Unexpected restarts, no reason found Questions / Help	0	205	October 12, 2021
Successful deploy VM keeps restarting Questions / Help troubleshooting	4	427	March 23, 2023

Cause of instance restart unclear

Related topics