Sudden increase in connections causes hard_limit to be exhausted (even with minimal test-case app doing no work)

abosworth · February 12, 2025, 11:00pm

Edited to add: we replaced the app with a totally minimal server.js that basically does nothing (see 3rd post) and the exact same behaviour happens: within a few minutes hard_limit is exhausted

We have a simple app that fields a lot of request but has worked fine for several years. Something has happened in the last few days that makes it become overwhelmed and the logs end up being filled with

could not find a good candidate within 21 attempts at load balancing

and

Instance xyz reached hard limit of 500 concurrent requests. This usually indicates your app is not responding fast enough for the traffic levels it is handling. Scaling resources, number of instances or increasing your hard limit might help.

I have tried raising hard_limit from 25 (which was work OK for years), or removing entirely, and yet it is always exhausted after a few minutes.

Separately we have tested various code / config changes (increasing machines, tuning the simple express app) without any luck.

We’ve been assuming that the sudden change is because of something external to our app and Fly (DDoS, traffic behaviour change) and we need to optimize how the app works, but then I came across this post from a few days ago Fly not sending traffic to my apps anymore? which implies some changes were made and then rolled back that might impact routing. I’m curious if there’s any chance that the internal changes have somehow bitten us?

Thanks in advance!

mabis · February 13, 2025, 3:49pm

what do the access logs of your app say regarding incoming requests?

abosworth · February 13, 2025, 7:13pm

Hi Mabis, good question! I installed morgan - npm and dropped it into express and removed basically all other functionality of the app. All the app does now is do this on a few endpoints:

app.post('/[redacted]', async (req, res) => {
  res.sendStatus(200)
  res.end();
})

So it is responding with a 200 and immediately closing the connection. Our real app used to do this and then do some other stuff, which I have removed from the live app for debugging purposes.

Morgan logs show very typical requests that we would expect, nothing that looks malicious.

Despite this, our app immediately falls over within 2-5 minutes of restarting:

Two restart attempts shown in this graph of HTPP response codes, pink/purple are 2xxs and the blue is 5xx:

Same two restarts across response times and concurrency (top of concurrency graph is 500, our current hard_limit):

For completeness, here is our fly.toml

# fly.toml file generated for flu-stats-production on 2022-12-20T15:08:50-08:00

app = "REDACTED"
kill_signal = "SIGINT"
kill_timeout = 5
processes = []

[env]
  PORT = "8080"

[http_service]
  internal_port = 8080
  force_https = true
  auto_stop_machines = false
  auto_start_machines = true
  min_machines_running = 1
  processes = ["app"]
  [http_service.concurrency]
    type = "requests"
    soft_limit = 200
    hard_limit = 500

And our current test-case minimal server.js file

const express = require("express");
const cors = require('cors')
const morgan = require('morgan')

const app = express();
app.disable('x-powered-by');

app.options('*', cors({
  origin: true,
}));
app.use(cors())
app.use(express.json());
app.use(express.urlencoded({ extended: false }));
app.use(express.text({
  type: 'text/plain',
}));
app.use(morgan('combined'))

const port = process.env.PORT || 3000;

app.get("/up", (req, res) => {
  res.status(200).send(`v${process.env.npm_package_version} ok ;)`);
  res.end();
});

app.post('/REDACTED', async (req, res) => {
  res.sendStatus(200)
  res.end();
})

app.post(['/REDACTED'], async (req, res) => {
  res.sendStatus(200);
  res.end();
});

app.listen(port, () => console.log(`REDACTED app version ${process.env.npm_package_version} listening at port ${port}!`));

I’m at a total loss about why this is happening!

abosworth · February 18, 2025, 8:05pm

Fly Support helped me out here: it was CPU exhaustion. Apparently 250 / reqs a second is enough to max out the shared-cpu-1x machines. I had scaled memory and machine count but not the machine itself.

system · February 25, 2025, 8:05pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
problem increasing app's connections `hard_limit`	4	1296	July 9, 2023
"could not find a good candidate within 90 attempts at load balancing" though app can be opened via SSH proxy	3	941	January 17, 2024
lingering connections and ghost vms. Questions / Help	5	588	September 16, 2022
Warning about hard limit on websocket app	5	983	May 26, 2023
Dowtime for more that 15 minutes already	7	350	November 3, 2022

Sudden increase in connections causes hard_limit to be exhausted (even with minimal test-case app doing no work)

Related topics