Edited to add: we replaced the app with a totally minimal server.js that basically does nothing (see 3rd post) and the exact same behaviour happens: within a few minutes hard_limit is exhausted
We have a simple app that fields a lot of request but has worked fine for several years. Something has happened in the last few days that makes it become overwhelmed and the logs end up being filled with
could not find a good candidate within 21 attempts at load balancing
and
Instance xyz reached hard limit of 500 concurrent requests. This usually indicates your app is not responding fast enough for the traffic levels it is handling. Scaling resources, number of instances or increasing your hard limit might help.
I have tried raising hard_limit from 25 (which was work OK for years), or removing entirely, and yet it is always exhausted after a few minutes.
Separately we have tested various code / config changes (increasing machines, tuning the simple express app) without any luck.
We’ve been assuming that the sudden change is because of something external to our app and Fly (DDoS, traffic behaviour change) and we need to optimize how the app works, but then I came across this post from a few days ago Fly not sending traffic to my apps anymore? which implies some changes were made and then rolled back that might impact routing. I’m curious if there’s any chance that the internal changes have somehow bitten us?
Thanks in advance!