I happened to notice one of my dev web apps (nodejs) stopped responding earlier today. I haven’t touched it for days. Last deploy a week ago, ish.
It is behind Cloudflare’s orange-cloud proxy and so all requests to it from its domain currently return a 525. But as I’ve found in the past, that status code is misleading. It’s not an “SSL” error. That is just what Cloudflare returns if it can’t get a response back from the app. Sure enough:
Instances ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED ff22ea98 app 431 lhr run running 1 total, 1 critical 0 2022-03-11T14:29:48Z
A critical failure. Its
/healthcheck URL should just respond with a 200 and some JSON. It does not touch a database or do anything, it is just returned from the app to check the nodejs event loop etc is running and so routes are generally working.
fly logs is not happy …
error.message="App connection timed out" 2022-03-20T22:32:07Z proxy[ff22ea98] lhr [error]error.code=2001 error.message="App connection timed out" 2022-03-20T22:33:07Z proxy[ff22ea98] lhr [error]error.code=2001 error.message="App connection timed out" 2022-03-20T22:33:10Z proxy[ff22ea98] lhr [error]error.code=2001 error.message="App connection timed out" 2022-03-20T22:33:18Z proxy[ff22ea98] lhr [error]error.code=2001 error.message="App connection timed out" 2022-03-20T22:33:27Z proxy[ff22ea98] lhr [error]error.code=2001 ...
It is just being pinged with some simple requests every minute or so which e.g send to AWS Cloudwatch, AWS SNS, as shown by the small http transfer. It is under no load. It doesn’t do any CPU stuff. I checked in Cloudflare and it did not receive a huge spike in requests or attack that I can see. Past 24 hours looks flat-ish there.
I currently have logs sent by the app to AWS Cloudwatch for debugging. It seems those requests stopped being sent at 2022-03-20T16:15:00.555Z.
Looking at the metrics in the Fly dashboard, that time matches when Firecracker load soared from 0 (the standard as it’s generally near-idle) to 2 and stayed there. RAM is flat, so no memory leak, so I guess that load is CPU? Hmm. That’s odd. As it’s not doing anything any different:
I can’t use
fly ssh console to access it, as it won’t connect. Presumably because of the load.
So … I’m not sure what I can do at my end to debug what went wrong with it or why the current load spike is there. I don’t mind it randomly died (as it’s a dev one, one instance) but it would be good to know why it did since I have the same nodejs etc on other apps. Has nodejs gone crazy? Has someone attacked it directly via its Fly IP (and so bypassed Cloudflare) and stuck some bitcoin miner or something evil on it? Is it an issue with its host machine? Weird.
I haven’t rebooted/restarted it so that you can see the current state but feel free to do that if you want to debug it. I’m guessing an app restart would fix it. But I’d like to know the cause first.
Any thoughts? Thanks!