Debugging Dropped Requests during Bursts

kurt · April 16, 2021, 5:41pm

@hhff will you add settings in the “timeout” section into the nginx config? Optimizing Nginx Configuration for High Loads - Fresh Blurbs by Irakli Nadareishvili

This is probably timeout related. We proxy the TCP connection to you, which adds a layer of complexity compared to connecting directly.

The http handler worked because it manages those connection intelligently. TCP is a little more brittle and you have to tweak nginx yourself.

We’ve run the app in a couple of regions, in general it seems to error more when we’re farther away from it and error almost none when it’s nearby.

kurt · April 16, 2021, 7:22pm

Ok I tested with some of those settings. It might be behaving better, it’s hard to tell! This is somewhat common for my artillery results:

All virtual users finished
Summary report @ 14:19:05(-0500) 2021-04-16
  Scenarios launched:  700
  Scenarios completed: 675
  Requests completed:  675
  Mean response/sec: 56
  Response time (msec):
    min: 71
    max: 11112.8
    median: 2218.9
    p95: 6615.9
    p99: 8163.1
  Scenario counts:
    0: 700 (100%)
  Codes:
    301: 675
  Errors:
    ECONNRESET: 25

The very high p95 and p99 times are suspicious.

I do not think this is a bug in our TCP proxy, exactly. But it’s probably related to TCP proxying + nginx. It’s very hard to tune nginx when it’s behind a proxy.

If it’s possible to let us handle TLS for you, and even HTTP, you’ll get much more reliable HTTP service. I’m unsure what you’re using the stream directive for, so maybe that’s not possible! Our HTTP handler does a lot to make HTTP request work, including retrying when a connection drops, connecting pooling, etc.

hhff · April 16, 2021, 9:22pm

Thanks @kurt ! Agree’d re: p95 & p99.

Did you see the issue where after you send these 700 requests, following requests (cURL, etc) hang and timeout after ~1 minute or so (after which the service seems to recover)? My suspicion is that’s the same culprit that Checkly have been reporting.

Did you see that our service provisions SSL in Lua and handles termination there? My understanding is that I can not use your TLS or HTTP handlers in such an architecture - correct me if I’m wrong.

(and, following from that, we use the stream directive to split raw HTTP & HTTPS traffic, so to treat them differently - I don’t believe there’s another way to do this?)

kurt · April 16, 2021, 9:28pm

Also one thing to consider: you could just be maxing out some component of our edge node. When you use TCP directly, what happens is:

Our proxy accepts a connection
The proxy connects to the region hosting the actual app instance over wireguard
Then it connects to your VM.

It effectively turns 1 TCP connection into three.

There is a lot in there that could hit a bottleneck. In a perfect world, we’d just route TCP connections directly to your VMs.

Did you see that our service provisions SSL in Lua and handles termination there? My understanding is that I can not use your TLS or HTTP handlers in such an architecture - correct me if I’m wrong.

I did! I’m curious what makes you need to do manage certificates and terminate TLS yourself? Running a single VM with its own certificate management is almost worst case for Fly. You’re probably better off doing that on something like DigitalOcean since it bypasses a lot of complexity.

We do have a GraphQL endpoint you can use to “install” certificates that we then use at the edge.

hhff · April 16, 2021, 10:31pm

The reason we do TLS ourself is because:

Fly forces you to use DNS validation for certs, but we’d like our end users to have the magic moment of “point your DNS at this IP, and it just works”.
Fly charges monthly per cert (the pricing is very reasonable!). But, this is an extra cost we’d need to pass on to our customers (and we’re hoping to bill purely usage based, rather than monthly subscriptions)

Fly is great for us because a single VM in a single region isn’t an acceptable architecture for a global ingress serving traffic from all over the world, that upgrades all connections to SSL and proxies through to the target (that’s what a subdomain does against this nginx config).

Our production environment is running in a bunch of regions worldwide (and storing certs in a centralized redis, not Fly’s local redis). As such, we’d like that ingress to be close to the end users - I was under the impression this is a perfect usecase for something like Fly.

–
But - reading between the lines I think you’re saying we should give up on Fly and move elsewhere, right?

kurt · April 16, 2021, 11:37pm

Ahh! Well, first thing is, you can just point a hostname at Fly and we issue the cert. DNS verification is only required for wildcards. By default, we do require that there’s an AAAA record to the app’s IPv6 to issue the cert. We can turn that off if it helps.

Second, from our experience, no one should usage rates unless they have to. If we were offering any other product we wouldn’t be doing usage rates. We tried to keep our cert prices low enough that no one would want to do their own certificate management and allow us to donate to Lets Encrypt. If you expect your customers to have thousands of certs, we can work out a different plan with you (either now or when you grow), if that’s the only hangup.

Given that you do want to these ingress servers close to everyone, we’re a better option than DigitalOcean. I misunderstood what you were doing since app I was looking at only had one VM. Bypassing our TLS stack and doing your own does skip a lot of what we’re actually good for, though! We do a lot of work to make TLS + HTTP really fast.

Jens · April 17, 2021, 2:43am

Is this also the case if I want to handle HTTP/2 connections in my app? Your docs advise to use e.g. the Go H2 stack and handle TLS ourselves if we want to use H2.

jerome · April 17, 2021, 12:17pm

http/2 is enabled by default for every port with the http hander.

If you want to handle H2 and TLS yourself, not using any handlers will forward TCP packets as-is to your app.

If you want us to handle TLS (and therefore manage your certificates), but want to handle H2 yourself, then you can just use the tls handler on the port. You need to use tls_options.alpn to tell our proxy to handle the h2 alpn, like so:

# ...
[[services.ports]]
port = 443
handlers = ["tls"]
tls_options = { alpn = ["h2", "http/1.1"] }

Jens · April 17, 2021, 12:35pm

In case I use the tls handler, how would I configure the go HTTP server to properly handle h2? Doesn’t go require tls for h2?

jerome · April 17, 2021, 12:37pm

You can use h2c · pkg.go.dev to create a non-tls h2 server.

We’ve tested this use case before and it works well.

Edit: packets will still be encrypted between your users and us since we handle the TLS session. They will also be encrypted between us and your app because we only send packets over the network over WireGuard.

Jens · April 17, 2021, 1:01pm

Is this the “optimal” setup you’d suggest for a go h2 server on fly? By optimal, I mean most efficient, stable, no weird timeouts etc…
Are there any tradeoffs to this if I terminate tls myself in terms of performance and stability?
It would be helpful if you could compare the different options.
Handling tls certs, updating them, etc. is not an issue for me. It’s more about getting the best in terms of stability and perf.

jerome · April 17, 2021, 1:12pm

(Perhaps this would be better in a different thread)

Unless you need to do http/2 push, then you don’t need to handle http/2 yourself.

It’s much less efficient than letting us and http/2 like we do for everybody else using the http handler

Some caveats:

No automatic compression of HTTP bodies (adhering to the Accept header, of course).
No Fly-Client-IP header, you’ll need to use the proxy_proto handler if you want us to send the haproxy protocol for each connection so you get the “real” IP.

If you terminate TLS yourself, it will likely be slower than if you let Fly do it. Instead of Fly handshaking close to your users, the handshake will sometimes happen an extra hop away (if you don’t have an app deployed to all our regions). Even if we’re forwarding TCP packets as-is, it’s not “free” in terms of performance.

We’re already handling tens of thousands of certificates and thousands of TLS handshakes per second. There are some optimizations we can do regarding session resumption, but the vast majority of handshakes are completed within ~25-50ms.

You should handle TLS if you want to support older protocols. We only support TLSv1.2 and TLSv1.3.

jerome · April 17, 2021, 1:15pm

@hhff we’re still investigating the dropped connections issue. I can’t reproduce it, but Kurt can. What’s odd is I VPN’d to many different locations around the world (including my Kurt’s) and I couldn’t reproduce.

I even ran artillery with an arrivalCount of 5000 (instead of the 700 from your configuration) and it was working fine for me. I’ve had decent p95 response times as well as throughput. They almost all completed within 5-10s. It only had trouble in certain regions (but only with such a high arrivalCount setting) or if I set the arrivalCount much higher (e.g.: 10000).

That said, I could only achieve the higher arrivalCount by not using https. I don’t know if there’s a bottlebeck there somehow. There’s no difference with how we handle connections for your app. Maybe there are NGINX tweaks specifically for TLS handling? However, we do suggest letting us handle TLS for you

hhff · April 17, 2021, 4:56pm

Oh cool! I had no idea - in that case I’ll spike out a version of the ingress this week that offloads TLS to Fly and we’ll see if that rectifies the issue.

Huh - very strange! I see the same dropped request issues on http & https. Just tested both http & https again this morning and I’m still seeing the dropped requests.

In any case, as a next step I’ll try offloading TLS to Fly this week, and we’ll see how that goes

hhff · April 23, 2021, 11:17pm

Hi folks,

I took a pass at this this afternoon.

I updated my DNS to include the AAAA record:

image1210×494 32.5 KB
I’ve deployed this nginx config (stripping back all SSL handling):

user www-data;
worker_processes auto;
pid /run/openresty.pid;

events {
  worker_connections 4096;
  multi_accept on;
}

http {
  sendfile on;
  keepalive_timeout 65;

  log_format main '[$time_local]($request) $host $status $body_bytes_sent';

  server {
    listen 8080;
    listen [::]:8080;

    location / {
      if ($http_x_forwarded_proto = "http") {
        return 301 https://$host$request_uri;
      }
      return 200 '你好';
    }
  }

  server {
    listen 8080;
    listen [::]:8080;
    server_name ~^[a-z\d-]*\.[a-z\d]+$;
    return 301 https://www.$host$request_uri;
  }
}

(The second server block is for redirecting a naked domain to www.)

I deployed with this fly toml:

app = "crystalizer-ingress-integration"

[[services]]
  internal_port = 8080
  protocol = "tcp"

  [services.concurrency]
    hard_limit = 2000
    soft_limit = 1800
    type = "requests"

  [[services.ports]]
    handlers = ["http"]
    port = "80"

  [[services.ports]]
    handlers = ["tls", "http"]
    port = "443"

Issue the cert:

flyctl certs create prism.shop --config ./fly.integration.toml

Re-run the loadtest

Results
Unfortunately, I am still getting the same behavior:

issue 700 request to the http://prism.shop or https://prism.shop url
it serves some or most of them
the load balancer appears to choke & timeout for up to a minute before it starts serving more requests

Is there something I’m missing? I was under the impression that using fly’s TLS & HTTP handlers (and the biggest VM size) would mean this nginx config should be able to handle thousands of concurrent requests.

jerome · April 24, 2021, 2:16am

I couldn’t reproduce myself, but maybe one of my colleagues can! We’ll keep digging for sure, this is odd.

Started phase 0, duration: 5s @ 21:49:06(-0400) 2021-04-23
Report @ 21:49:13(-0400) 2021-04-23
Elapsed time: 6 seconds
  Scenarios launched:  1500
  Scenarios completed: 1500
  Requests completed:  1500
  Mean response/sec: 242.33
  Response time (msec):
    min: 106.4
    max: 298.3
    median: 121.5
    p95: 205.4
    p99: 264
  Codes:
    301: 1500

All virtual users finished
Summary report @ 21:49:13(-0400) 2021-04-23
  Scenarios launched:  1500
  Scenarios completed: 1500
  Requests completed:  1500
  Mean response/sec: 240.77
  Response time (msec):
    min: 106.4
    max: 298.3
    median: 121.5
    p95: 205.4
    p99: 264
  Scenario counts:
    0: 1500 (100%)
  Codes:
    301: 1500

config:
  target: https://prism.shop
  phases:
    - duration: 5
      arrivalCount: 1500
scenarios:
  - flow:
      - get:
          url: "/"
          followRedirect: false

kurt · April 25, 2021, 12:20am

I’m pretty sure the load testing is failing on the client side, not at the load balancer. I can’t replicate the failures from a beefy VM, but they happen consistently on my macbook over wifi.

hhff · April 29, 2021, 4:52pm

Got it. In that case it might be the ISP throttling my requests after a burst.

Back to the drawing board. I’ll do some more research!

kurt · April 29, 2021, 7:03pm

I don’t think it’s our ISP! I think it’s just a system level issue, that tool is sending 700 requests at once, which is quite a lot. That works out to >7k requests per second if each one takes 100ms to complete.

Topic		Replies	Views
understanding spikes in P99	18	1699	July 21, 2021
Throttling and DDoS protection for fly app	8	1963	March 28, 2023
Infrastructure dev notes	17	1839	February 1, 2022
Fly.io down, or is it just me?	2	406	May 20, 2023
Dowtime for more that 15 minutes already	7	350	November 3, 2022

Debugging Dropped Requests during Bursts

Related topics