Any thoughts on why my app randomly died?

Hello,

I happened to notice one of my dev web apps (nodejs) stopped responding earlier today. I haven’t touched it for days. Last deploy a week ago, ish.

It is behind Cloudflare’s orange-cloud proxy and so all requests to it from its domain currently return a 525. But as I’ve found in the past, that status code is misleading. It’s not an “SSL” error. That is just what Cloudflare returns if it can’t get a response back from the app. Sure enough:

Instances
ID      	PROCESS	VERSION	REGION	DESIRED	STATUS 	HEALTH CHECKS      	RESTARTS	CREATED              
ff22ea98	app    	431    	lhr   	run    	running	1 total, 1 critical	0       	2022-03-11T14:29:48Z

A critical failure. Its /healthcheck URL should just respond with a 200 and some JSON. It does not touch a database or do anything, it is just returned from the app to check the nodejs event loop etc is running and so routes are generally working.

And fly logs is not happy …

error.message="App connection timed out" 2022-03-20T22:32:07Z proxy[ff22ea98] lhr [error]error.code=2001 
error.message="App connection timed out" 2022-03-20T22:33:07Z proxy[ff22ea98] lhr [error]error.code=2001 
error.message="App connection timed out" 2022-03-20T22:33:10Z proxy[ff22ea98] lhr [error]error.code=2001 
error.message="App connection timed out" 2022-03-20T22:33:18Z proxy[ff22ea98] lhr [error]error.code=2001 
error.message="App connection timed out" 2022-03-20T22:33:27Z proxy[ff22ea98] lhr [error]error.code=2001 
...

It is just being pinged with some simple requests every minute or so which e.g send to AWS Cloudwatch, AWS SNS, as shown by the small http transfer. It is under no load. It doesn’t do any CPU stuff. I checked in Cloudflare and it did not receive a huge spike in requests or attack that I can see. Past 24 hours looks flat-ish there.

I currently have logs sent by the app to AWS Cloudwatch for debugging. It seems those requests stopped being sent at 2022-03-20T16:15:00.555Z.

Looking at the metrics in the Fly dashboard, that time matches when Firecracker load soared from 0 (the standard as it’s generally near-idle) to 2 and stayed there. RAM is flat, so no memory leak, so I guess that load is CPU? Hmm. That’s odd. As it’s not doing anything any different:

I can’t use fly ssh console to access it, as it won’t connect. Presumably because of the load.

So … I’m not sure what I can do at my end to debug what went wrong with it or why the current load spike is there. I don’t mind it randomly died (as it’s a dev one, one instance) but it would be good to know why it did since I have the same nodejs etc on other apps. Has nodejs gone crazy? Has someone attacked it directly via its Fly IP (and so bypassed Cloudflare) and stuck some bitcoin miner or something evil on it? Is it an issue with its host machine? Weird.

I haven’t rebooted/restarted it so that you can see the current state but feel free to do that if you want to debug it. I’m guessing an app restart would fix it. But I’d like to know the cause first.

Any thoughts? Thanks!

Ok, well I did restart it in order to ssh in to investigate.

And sure enough, doing so did drop the load back to 0. So it’s now 0 if you wonder why it all looks well.

To check to see if anyone naughty had somehow added anything (not sure how), I added htop and checked the processes … and all is well. The app continues to do basically nothing. One node process, as expected. And some Fly innards:

No CPU load. Minimal memory. All is well. No bitcoinminer.evil or something evil there!

So … I’m no further along.

Thinking about it … I’m not entirely sure why the app wasn’t auto-restarted. If an app fails a http healthcheck, should it be restarted by Fly? Is that something I need to configure? I’ve used healthchecks for getting an app up and running (seeing them pass, so the app then deploys) but now I wonder how/if they apply in this context, of a deployed app failing. I’ve been looking at the docs Fly Launch configuration (fly.toml) · Fly Docs I don’t see any mention of the health-checks being used to trigger a restart. If not, maybe that option could be added. Something like restart_after_failures: 5? I do see there are health check handlers (like slack) which would notify a failure. However (inevitably) failures happen at an unhelpful time of day so an auto-restart option/param would be handy (if it’s not in fact already supported and I’m simply missing that which is entirely possible!)

Load could actually indicate hung requests / connections to external services, too. Not just pure CPU usage.

This looks like a Node process got wedged to me. You’d need to have some kind of Node specific metrics to see it, but it’s not uncommon for the Node event loop to get backed up and drag everything to a halt.

There is an option to restart on health check failures. Add restart_limit = 6 to the health check definition and it’ll restart the process after 6 consecutive failures.

1 Like

Thanks @kurt !

Ah. I’ll add that restart_limit then. It would be handy to have that healthcheck option listed on App Configuration (fly.toml)

As for node, hmm. Yes, that would make sense. If the event loop got stuck and so that would indeed mean it would not respond until it was restarted. I don’t have New Relic or similar realtime agent/monitor on it to diagnose that. I have external logs and no errors were sent to them, and all external requests have a timeout. So it seems odd it would spontaneously just get stuck. But who knows. I guess there is no way to know now since the only way to ssh in to look was to restart it, and restarting it means there is no event loop to check. Sigh. Ah well.

+1 to adding that bit to the docs. I’ve also been running into similar weird Node hangs that I haven’t been able to pinpoint… Also witnessed the load just going crazy and timeout logs. I’ll try the restart_limit option so my process will at least recover itself :grinning:.
Also will look into extending my metrics with GitHub - siimon/prom-client: Prometheus client for node.js.

1 Like