I have an openresty container that more or less just serves 301 redirects.
Our monitoring service started to alert us that it’s occasionally hanging / not responding.
I’ve been able to reproduce this by load testing say ~250 requests sent to the IP at once. What seems to happen is:
the 250x 301s return fine
the next request hangs for up to 1 minute or so
after about 1 minute, it seems to return to normal
I’ve got access_log and error_log going to stdout (I don’t see anything that is unexpected), but when that next (hanging) request is sent, nginx doesn’t log anything.
Ie - it appears the dropped request (immediately after a burst of 250), doesn’t actually get to my container.
What I’m wondering is:
does fly.io have any type of burst protection at the Load Balancer level that would explain this behavior?
The app is called “crystalizer-ingress-production”. I’m not sure the exact date we started seeing the behavior, but I believe it was around March 20th, give or take a week.
When you do TCP only services like this, it uses a different mechanism to “backhaul” to your application from the edge server. It works pretty well, but we haven’t had customers hammer on TCP services like we have HTTP services, so there’s a chance you’re hitting a bug in our TCP backahul.
The changes to the config file will put our HTTP handler into play, and also enable connection pooling to your VM instances. We’ve had other apps violently abuse our HTTP stack so we have a pretty good idea of how it works.
Right now, we’re actually handling TLS termination at the nginx level (that redirect I mention above is for redirecting naked domains to www with HTTPS); so I don’t think the ['http'] handler will work for us, right?
I know support can be hard to stay on top of, but I just want to be sure this one doesn’t get lost - our client has been asking for a solution for over a week now
Is your client currently experiencing issues with burst limits? If so, I’d love to know which app it is so I can look it up. We’re running apps with significant usage and the old burst limits were not a problem.
Just tried another set of load tests on the app I mention above - crystalizer-ingress-integration; and I’m seeing the same behavior. One thundering herd of requests is served mostly successfully, but then the following request hangs for roughly one minute.
Please keep in mind: I am using TCP, not the HTTP handler (as we need to handle TLS termination at the nginx level). Kurt seemed to imply that could mean we’re stuck with a buggy TCP backhaul implementation.
Ultimately, I’m trying to understand why this is happening (screenshot from Checkly):
The red lines indicate checks that took over 30 seconds to respond (and were marked as failed). In this case, Checkly is simply checking the 301 redirect (with TLS termination at the nginx level) is responding.
We test this service from 40 different locations around the world. Did something change on the 14th of April, or is that a red herring?
We deploy all the time. I’ll look back on the 14th to see what we deployed then.
I’m also going to dig deeper on this. I’ll get back to you later this afternoon.
Our metrics don’t show many connections to your app, at all. So that’s odd. Our limits are far higher than what we’re seeing for crystallizer-ingress-integration.
Let me iron out some details: crystalizer-ingress-production is the one that end users hit. It often gets bursts of traffic from a couple big editorial websites crystalizer-ingress-integration is the one i’m using in debugging this issue. It is not used by end users.
Also - it slipped my mind, but on April 14th, I switched the Checkly monitor to crystalizer-ingress-integration, and scaled out the box to the highest tier VM (which should be unnecessary, given we’re just running nginx). So that could explain the lack of failed requests after that point.
Also Kurt - your load test uses http, which (on port 80), does use the ['http'] handler. This will redirect to https but not change the path.
Our load test uses https, which can not use the ['http'] handler, as we handle TLS at nginx for this case. So perhaps if you run a test against https://prism.shop (our testing URL which maps to the fly IP for crystalizer-ingress-integration, and do something like 700 arrivals in 5 seconds, you’ll see the behavior I mean (a cURL following that herd of requests will hang for about a minute).