Debugging Dropped Requests during Bursts

Hi folks,

I have an openresty container that more or less just serves 301 redirects.

Our monitoring service started to alert us that it’s occasionally hanging / not responding.

I’ve been able to reproduce this by load testing say ~250 requests sent to the IP at once. What seems to happen is:

  • the 250x 301s return fine
  • the next request hangs for up to 1 minute or so
  • after about 1 minute, it seems to return to normal

I’ve got access_log and error_log going to stdout (I don’t see anything that is unexpected), but when that next (hanging) request is sent, nginx doesn’t log anything.

Ie - it appears the dropped request (immediately after a burst of 250), doesn’t actually get to my container.

What I’m wondering is:

  • does fly.io have any type of burst protection at the Load Balancer level that would explain this behavior?

Thank you!

There is a burst protection. We’ve added it recently based on our current metrics. We’re still tweaking it.

It shouldn’t have waited 1 minute to let more connections through though!

Would you mind telling me what app this is? I can figure out if this is indeed our burst protection.

More precisely, burst protection is on the number of tcp connections per second. It’s currently set at 300 / second / edge server / app.

Normal HTTP clients tend to reuse connections and don’t usually trigger the rate limit.

Thanks Jerome,

The app is called “crystalizer-ingress-production”. I’m not sure the exact date we started seeing the behavior, but I believe it was around March 20th, give or take a week.

Thank you!

(However, currently I’ve been testing and tuning in “crystalizer-ingress-integration”)

Will you try testing two things for me?

  1. Set handlers = ['http'] on your service
  2. Add type = "requests" to your concurrency block?

When you do TCP only services like this, it uses a different mechanism to “backhaul” to your application from the edge server. It works pretty well, but we haven’t had customers hammer on TCP services like we have HTTP services, so there’s a chance you’re hitting a bug in our TCP backahul.

The changes to the config file will put our HTTP handler into play, and also enable connection pooling to your VM instances. We’ve had other apps violently abuse our HTTP stack so we have a pretty good idea of how it works.

Thanks Kurt!

Right now, we’re actually handling TLS termination at the nginx level (that redirect I mention above is for redirecting naked domains to www with HTTPS); so I don’t think the ['http'] handler will work for us, right?

app = "crystalizer-ingress-integration"

[[services]]
  internal_port = 8080
  protocol = "tcp"

  [services.concurrency]
    hard_limit = 2000
    soft_limit = 1800
    type = "requests"

  [[services.ports]]
    handlers = ["http"]
    port = "80"

  [[services.ports]]
    handlers = ["http"]
    port = "443"

  [[services.http_checks]]
    interval = 10000
    method = "get"
    path = "/__healthz__"
    protocol = "http"
    timeout = 2000
    [services.http_checks.headers]
      Host = "health.check"

When I deploy, and curl a domain like https://mynakeddomain.com, I see:

curl: (35) error:1400410B:SSL routines:CONNECT_CR_SRVR_HELLO:wrong version number

Hi folks,

I know support can be hard to stay on top of, but I just want to be sure this one doesn’t get lost - our client has been asking for a solution for over a week now :slight_smile:

Thank you for your hard work!

What kind of burst are we talking about?

I’ve deployed higher limits.

Is your client currently experiencing issues with burst limits? If so, I’d love to know which app it is so I can look it up. We’re running apps with significant usage and the old burst limits were not a problem.

Let me know how it looks now.

Hi Jerome,

Just tried another set of load tests on the app I mention above - crystalizer-ingress-integration; and I’m seeing the same behavior. One thundering herd of requests is served mostly successfully, but then the following request hangs for roughly one minute.

Please keep in mind: I am using TCP, not the HTTP handler (as we need to handle TLS termination at the nginx level). Kurt seemed to imply that could mean we’re stuck with a buggy TCP backhaul implementation.

Ultimately, I’m trying to understand why this is happening (screenshot from Checkly):

The red lines indicate checks that took over 30 seconds to respond (and were marked as failed). In this case, Checkly is simply checking the 301 redirect (with TLS termination at the nginx level) is responding.

We test this service from 40 different locations around the world. Did something change on the 14th of April, or is that a red herring?

We deploy all the time. I’ll look back on the 14th to see what we deployed then.

I’m also going to dig deeper on this. I’ll get back to you later this afternoon.

Our metrics don’t show many connections to your app, at all. So that’s odd. Our limits are far higher than what we’re seeing for crystallizer-ingress-integration.

Hi Jerome,

Let me iron out some details:
crystalizer-ingress-production is the one that end users hit. It often gets bursts of traffic from a couple big editorial websites
crystalizer-ingress-integration is the one i’m using in debugging this issue. It is not used by end users.

Also - it slipped my mind, but on April 14th, I switched the Checkly monitor to crystalizer-ingress-integration, and scaled out the box to the highest tier VM (which should be unnecessary, given we’re just running nginx). So that could explain the lack of failed requests after that point.

Sorry for the confusion!

Is there a chance your tests are following redirects? I just tried this script with k6 and got no errors:

import http from "k6/http";
import { check } from "k6";
export default function () {
    const opts = {
        redirects: 0
    }
    check(http.get("http://crystalizer-ingress-integration.fly.dev", opts), {
        "status is 301": (r) => r.status == 301,
    });
}
k6 run --vus 50 --duration 120s crystalizer.js

          /\      |‾‾| /‾‾/   /‾‾/
     /\  /  \     |  |/  /   /  /
    /  \/    \    |     (   /   ‾‾\
   /          \   |  |\  \ |  (‾)  |
  / __________ \  |__| \__\ \_____/ .io

  execution: local
     script: crystalizer.js
     output: -

  scenarios: (100.00%) 1 scenario, 50 max VUs, 2m30s max duration (incl. graceful stop):
           * default: 50 looping VUs for 2m0s (gracefulStop: 30s)

^C
running (0m58.9s), 00/50 VUs, 48482 complete and 50 interrupted iterations
default ✗ [=================>--------------------] 50 VUs  0m58.9s/2m0s

     ✓ status is 301

     checks.........................: 100.00% ✓ 48482 ✗ 0
     data_received..................: 23 MB   388 kB/s
     data_sent......................: 5.1 MB  86 kB/s
     http_req_blocked...............: avg=323.87µs min=1µs     med=5µs     max=284.49ms p(90)=9µs     p(95)=11µs
     http_req_connecting............: avg=314.89µs min=0s      med=0s      max=284.37ms p(90)=0s      p(95)=0s
     http_req_duration..............: avg=60.15ms  min=35.48ms med=53.72ms max=2.16s    p(90)=75.01ms p(95)=86.88ms
       { expected_response:true }...: avg=60.15ms  min=35.48ms med=53.72ms max=2.16s    p(90)=75.01ms p(95)=86.88ms
     http_req_failed................: 0.00%   ✓ 0     ✗ 48482
     http_req_receiving.............: avg=123.07µs min=21µs    med=105µs   max=10.02ms  p(90)=191µs   p(95)=230µs
     http_req_sending...............: avg=32.46µs  min=5µs     med=25µs    max=2.96ms   p(90)=47µs    p(95)=65µs
     http_req_tls_handshaking.......: avg=0s       min=0s      med=0s      max=0s       p(90)=0s      p(95)=0s
     http_req_waiting...............: avg=60ms     min=35.31ms med=53.57ms max=2.16s    p(90)=74.86ms p(95)=86.7ms
     http_reqs......................: 48482   822.95084/s
     iteration_duration.............: avg=60.71ms  min=35.69ms med=54ms    max=2.16s    p(90)=76ms    p(95)=88.48ms
     iterations.....................: 48482   822.95084/s
     vus............................: 50      min=50  max=50
     vus_max........................: 50      min=50  max=50

I stopped it halfway through.

We are not following redirects. We are using artillery - here’s our loadtest.yaml:

config:
  target: https://prism.shop
  phases:
    - duration: 5
      arrivalCount: 700
scenarios:
  - flow:
      - get:
          url: "/"
          followRedirect: false

Also Kurt - your load test uses http, which (on port 80), does use the ['http'] handler. This will redirect to https but not change the path.

Our load test uses https, which can not use the ['http'] handler, as we handle TLS at nginx for this case. So perhaps if you run a test against https://prism.shop (our testing URL which maps to the fly IP for crystalizer-ingress-integration, and do something like 700 arrivals in 5 seconds, you’ll see the behavior I mean (a cURL following that herd of requests will hang for about a minute).

Will you turn off the http handler for port 80 as well? It’d be helpful to test without tls.

I’ll give artillery a shot, once I figure out what its options mean. :slight_smile:

Just redployed with this toml:

app = "crystalizer-ingress-integration"

[[services]]
  internal_port = 8080
  protocol = "tcp"

  [services.concurrency]
    hard_limit = 2000
    soft_limit = 1800
    type = "requests"

  [[services.ports]]
    port = "80"

  [[services.ports]]
    port = "443"

  [[services.http_checks]]
    interval = 10000
    method = "get"
    path = "/__healthz__"
    protocol = "http"
    timeout = 2000
    [services.http_checks.headers]
      Host = "health.check"

Oh that’s helpful. Can you share your nginx config? This could be connection timeouts between our load balancer and nginx.

Hi Kurt,

If you email me at hugh@sanctuary.computer or send me your github user I’ll add you to the codebase!

mrkurt on github works!