Debugging Dropped Requests during Bursts

hhff · April 13, 2021, 2:17pm

Hi folks,

I have an openresty container that more or less just serves 301 redirects.

Our monitoring service started to alert us that it’s occasionally hanging / not responding.

I’ve been able to reproduce this by load testing say ~250 requests sent to the IP at once. What seems to happen is:

the 250x 301s return fine
the next request hangs for up to 1 minute or so
after about 1 minute, it seems to return to normal

I’ve got access_log and error_log going to stdout (I don’t see anything that is unexpected), but when that next (hanging) request is sent, nginx doesn’t log anything.

Ie - it appears the dropped request (immediately after a burst of 250), doesn’t actually get to my container.

What I’m wondering is:

does fly.io have any type of burst protection at the Load Balancer level that would explain this behavior?

Thank you!

jerome · April 13, 2021, 2:24pm

There is a burst protection. We’ve added it recently based on our current metrics. We’re still tweaking it.

It shouldn’t have waited 1 minute to let more connections through though!

Would you mind telling me what app this is? I can figure out if this is indeed our burst protection.

jerome · April 13, 2021, 2:28pm

More precisely, burst protection is on the number of tcp connections per second. It’s currently set at 300 / second / edge server / app.

Normal HTTP clients tend to reuse connections and don’t usually trigger the rate limit.

hhff · April 13, 2021, 2:43pm

Thanks Jerome,

The app is called “crystalizer-ingress-production”. I’m not sure the exact date we started seeing the behavior, but I believe it was around March 20th, give or take a week.

Thank you!

hhff · April 13, 2021, 2:44pm

(However, currently I’ve been testing and tuning in “crystalizer-ingress-integration”)

kurt · April 13, 2021, 2:56pm

Will you try testing two things for me?

Set handlers = ['http'] on your service
Add type = "requests" to your concurrency block?

When you do TCP only services like this, it uses a different mechanism to “backhaul” to your application from the edge server. It works pretty well, but we haven’t had customers hammer on TCP services like we have HTTP services, so there’s a chance you’re hitting a bug in our TCP backahul.

The changes to the config file will put our HTTP handler into play, and also enable connection pooling to your VM instances. We’ve had other apps violently abuse our HTTP stack so we have a pretty good idea of how it works.

hhff · April 13, 2021, 7:52pm

Thanks Kurt!

Right now, we’re actually handling TLS termination at the nginx level (that redirect I mention above is for redirecting naked domains to www with HTTPS); so I don’t think the ['http'] handler will work for us, right?

app = "crystalizer-ingress-integration"

[[services]]
  internal_port = 8080
  protocol = "tcp"

  [services.concurrency]
    hard_limit = 2000
    soft_limit = 1800
    type = "requests"

  [[services.ports]]
    handlers = ["http"]
    port = "80"

  [[services.ports]]
    handlers = ["http"]
    port = "443"

  [[services.http_checks]]
    interval = 10000
    method = "get"
    path = "/__healthz__"
    protocol = "http"
    timeout = 2000
    [services.http_checks.headers]
      Host = "health.check"

When I deploy, and curl a domain like https://mynakeddomain.com, I see:

curl: (35) error:1400410B:SSL routines:CONNECT_CR_SRVR_HELLO:wrong version number

hhff · April 16, 2021, 2:52pm

Hi folks,

I know support can be hard to stay on top of, but I just want to be sure this one doesn’t get lost - our client has been asking for a solution for over a week now

Thank you for your hard work!

jerome · April 16, 2021, 3:07pm

What kind of burst are we talking about?

I’ve deployed higher limits.

Is your client currently experiencing issues with burst limits? If so, I’d love to know which app it is so I can look it up. We’re running apps with significant usage and the old burst limits were not a problem.

Let me know how it looks now.

hhff · April 16, 2021, 3:28pm

Hi Jerome,

Just tried another set of load tests on the app I mention above - crystalizer-ingress-integration; and I’m seeing the same behavior. One thundering herd of requests is served mostly successfully, but then the following request hangs for roughly one minute.

Please keep in mind: I am using TCP, not the HTTP handler (as we need to handle TLS termination at the nginx level). Kurt seemed to imply that could mean we’re stuck with a buggy TCP backhaul implementation.

Ultimately, I’m trying to understand why this is happening (screenshot from Checkly):

The red lines indicate checks that took over 30 seconds to respond (and were marked as failed). In this case, Checkly is simply checking the 301 redirect (with TLS termination at the nginx level) is responding.

We test this service from 40 different locations around the world. Did something change on the 14th of April, or is that a red herring?

jerome · April 16, 2021, 3:35pm

We deploy all the time. I’ll look back on the 14th to see what we deployed then.

I’m also going to dig deeper on this. I’ll get back to you later this afternoon.

Our metrics don’t show many connections to your app, at all. So that’s odd. Our limits are far higher than what we’re seeing for crystallizer-ingress-integration.

hhff · April 16, 2021, 4:14pm

Hi Jerome,

Let me iron out some details:
crystalizer-ingress-production is the one that end users hit. It often gets bursts of traffic from a couple big editorial websites
crystalizer-ingress-integration is the one i’m using in debugging this issue. It is not used by end users.

Also - it slipped my mind, but on April 14th, I switched the Checkly monitor to crystalizer-ingress-integration, and scaled out the box to the highest tier VM (which should be unnecessary, given we’re just running nginx). So that could explain the lack of failed requests after that point.

Sorry for the confusion!

kurt · April 16, 2021, 4:15pm

Is there a chance your tests are following redirects? I just tried this script with k6 and got no errors:

import http from "k6/http";
import { check } from "k6";
export default function () {
    const opts = {
        redirects: 0
    }
    check(http.get("http://crystalizer-ingress-integration.fly.dev", opts), {
        "status is 301": (r) => r.status == 301,
    });
}

k6 run --vus 50 --duration 120s crystalizer.js

          /\      |‾‾| /‾‾/   /‾‾/
     /\  /  \     |  |/  /   /  /
    /  \/    \    |     (   /   ‾‾\
   /          \   |  |\  \ |  (‾)  |
  / __________ \  |__| \__\ \_____/ .io

  execution: local
     script: crystalizer.js
     output: -

  scenarios: (100.00%) 1 scenario, 50 max VUs, 2m30s max duration (incl. graceful stop):
           * default: 50 looping VUs for 2m0s (gracefulStop: 30s)

^C
running (0m58.9s), 00/50 VUs, 48482 complete and 50 interrupted iterations
default ✗ [=================>--------------------] 50 VUs  0m58.9s/2m0s

     ✓ status is 301

     checks.........................: 100.00% ✓ 48482 ✗ 0
     data_received..................: 23 MB   388 kB/s
     data_sent......................: 5.1 MB  86 kB/s
     http_req_blocked...............: avg=323.87µs min=1µs     med=5µs     max=284.49ms p(90)=9µs     p(95)=11µs
     http_req_connecting............: avg=314.89µs min=0s      med=0s      max=284.37ms p(90)=0s      p(95)=0s
     http_req_duration..............: avg=60.15ms  min=35.48ms med=53.72ms max=2.16s    p(90)=75.01ms p(95)=86.88ms
       { expected_response:true }...: avg=60.15ms  min=35.48ms med=53.72ms max=2.16s    p(90)=75.01ms p(95)=86.88ms
     http_req_failed................: 0.00%   ✓ 0     ✗ 48482
     http_req_receiving.............: avg=123.07µs min=21µs    med=105µs   max=10.02ms  p(90)=191µs   p(95)=230µs
     http_req_sending...............: avg=32.46µs  min=5µs     med=25µs    max=2.96ms   p(90)=47µs    p(95)=65µs
     http_req_tls_handshaking.......: avg=0s       min=0s      med=0s      max=0s       p(90)=0s      p(95)=0s
     http_req_waiting...............: avg=60ms     min=35.31ms med=53.57ms max=2.16s    p(90)=74.86ms p(95)=86.7ms
     http_reqs......................: 48482   822.95084/s
     iteration_duration.............: avg=60.71ms  min=35.69ms med=54ms    max=2.16s    p(90)=76ms    p(95)=88.48ms
     iterations.....................: 48482   822.95084/s
     vus............................: 50      min=50  max=50
     vus_max........................: 50      min=50  max=50

I stopped it halfway through.

hhff · April 16, 2021, 4:16pm

We are not following redirects. We are using artillery - here’s our loadtest.yaml:

config:
  target: https://prism.shop
  phases:
    - duration: 5
      arrivalCount: 700
scenarios:
  - flow:
      - get:
          url: "/"
          followRedirect: false

hhff · April 16, 2021, 4:20pm

Also Kurt - your load test uses http, which (on port 80), does use the ['http'] handler. This will redirect to https but not change the path.

Our load test uses https, which can not use the ['http'] handler, as we handle TLS at nginx for this case. So perhaps if you run a test against https://prism.shop (our testing URL which maps to the fly IP for crystalizer-ingress-integration, and do something like 700 arrivals in 5 seconds, you’ll see the behavior I mean (a cURL following that herd of requests will hang for about a minute).

kurt · April 16, 2021, 4:25pm

Will you turn off the http handler for port 80 as well? It’d be helpful to test without tls.

I’ll give artillery a shot, once I figure out what its options mean.

hhff · April 16, 2021, 4:29pm

Just redployed with this toml:

app = "crystalizer-ingress-integration"

[[services]]
  internal_port = 8080
  protocol = "tcp"

  [services.concurrency]
    hard_limit = 2000
    soft_limit = 1800
    type = "requests"

  [[services.ports]]
    port = "80"

  [[services.ports]]
    port = "443"

  [[services.http_checks]]
    interval = 10000
    method = "get"
    path = "/__healthz__"
    protocol = "http"
    timeout = 2000
    [services.http_checks.headers]
      Host = "health.check"

kurt · April 16, 2021, 4:35pm

Oh that’s helpful. Can you share your nginx config? This could be connection timeouts between our load balancer and nginx.

hhff · April 16, 2021, 4:37pm

Hi Kurt,

If you email me at hugh@sanctuary.computer or send me your github user I’ll add you to the codebase!

kurt · April 16, 2021, 4:38pm

mrkurt on github works!

Topic		Replies	Views
understanding spikes in P99	18	1699	July 21, 2021
Throttling and DDoS protection for fly app	8	1963	March 28, 2023
Infrastructure dev notes	17	1839	February 1, 2022
Fly.io down, or is it just me?	2	406	May 20, 2023
Dowtime for more that 15 minutes already	7	350	November 3, 2022

Debugging Dropped Requests during Bursts

Related topics