understanding spikes in P99

hello, I have an app that’s set up as a reverse proxy. Every 5 mins or so I see a spike in p99 latency matched by a spike in server concurrency. Meanwhile, there’s no apparent increase in http traffic.

I’m trying to figure out what’s going on here. My theory is that the non-fly server I’m connecting to is the one being slow, so open connections start piling up on Fly. I guess I’m looking for a way to determine the source of the pileup. If the concurrency in fly is actually what’s causing the spike in latency, then I’d obviously lower my concurrency limits. But I just don’t know!

The server is nikola-proxy btw. Thank you!

Spikes in concurrency do sometimes correlate to slower responses.

You own the code to this? It is handy to instrument the “origin” side of reverse proxies, so what might be helpful is to add a prometheus exporter to your app that tracks request times to the origin, then hook it up to Fly and Grafana: Hooking Up Fly Metrics · Fly

I looked at what we track for your app and there’s not anything obviously wrong. I did notice a couple DigitalOcean IPs connect enough to get throttled – we limit how many TLS handshakes a single IP can perform in a given amount of time and then they queue up. It probably wouldn’t influence the 99th percentile spikes though!

Thanks for taking a look! yes, I own the code that’s triggering this. I’ll take a look at instrumenting the origin side.

All critical requests will be coming from a handful of digitalocean servers. Maybe I’m doing something wrong by heaving each request do its own handshake? But I will say the fact that they’re being throttled sounds problematic. Can you tell me a bit more about the nature of the throttling that’s occurring? Also, is it per-app, per-region, or per-server that the throttling is happening?

FWIW, it is more than the P99 that’s being affected.

You are better off reusing connections and minimizing handshakes in general, it’ll make stuff faster and also keep from triggering weird DoS protections (like our handshake throttle). Is this a Python client? It’s probably simple to enable a connection pool.

TLS throttling doesn’t affect that HTTP response time metric, though. Those spikes happen separately. It’s normal for 95th and 99th percentile response times to spike at the same time when there’s a burst of slow requests, all those numbers mean are “5% of requests were slower than X”.

Throttling is maybe the wrong word. We allow 30 concurrent TLS handshakes per source IP block per app per edge node they connect to. You may not even notice it, the effect is to slower handshakes for specific IPs. They usually still work but it might take a couple of hundred extra milliseconds while they wait for an opening.

1 Like

@david Looking at little closer at our metrics for your app:

  • I see your app has the same response time patterns, meaning that’s likely the source of the issue.
  • Your app is often reaching it’s hard limit of 80 connections. Meaning we’ll retry our load balancing logic trying to get another instance. This adds even more latency.

What reverse proxy are you using? There might be configuration options to tweak to fix this. I think you’re using nginx, right? It shouldn’t have any trouble handling hundreds or thousands of connections.

Could it be a slow upstream? What is your nginx reverse-proxying to?

1 Like

Thank you guys. A few clarifications here:

@kurt:

  1. It is indeed python and I’ll look in to connection pooling. It’s not obvious that’s going to be a quick change for me in my code base.

  2. So you’re saying the response time metric only starts counting after the TLS handshake?

  3. Given that 99% of my traffic will be from 3-5 hosts for these services, I’d love to understand a bit more about the concurrent handshake limit. I’m not great on the terminology here (edge node, lost me), so I’m not 100% sure if scaling to more apps or just more regions could test this issue.

@jerome

  1. You’re saying the response time + concurrency spike are in sync, meaning it’s probably my app that’s the issue?
  2. FWIW when I ran into this issue, the hard and soft limits were 300+ and I recently had a load average above 2(!!!), so last night I tried increasing the lower bound number of instances and lowering the connection limits. I thought this would autoscale to reach some sort of equilibrium.
  3. Am indeed using nginx and I do think it’s likely an issue with a slow upstream, but I wanted to understand the graphs more. It’s proxying to an upstream server operated by an automaker API that is indeed a likely candidate for being unstable.

Another update here. I tried routing a subset of traffic to some other apps with the same setup. They have coincident spikes. To me this means it’s even more likely than before that it’s a slow upstream, but at least in my head the handshake limit could also be a factor.

The response time metrics is really “time to first response header since the start of the HTTP request”. This starts after TLS and ends before sending the body. So it’s more like Time To First Byte (TTFB).

Both the “edge” response time and “app” response time spike. Meaning the edge response time is caused by the app’s response time.

It should help if the load average is very high. Though I can’t figure out why nginx would use so much resources.

1 Like

okay so I’ve changed the soft limit to 100 and the hard limit to 400. AND I lowered the lower bound on autoscaling. I suspect we’ll see no immediate change here because this likely wasn’t the cause. but perhaps we’ll scale more smartly going forward.

On the topic of TLS handshakes, which I understand not to be related to the graphs I posted about, is there a metric I can look at to understand how much of a backup that is causing?

It wouldn’t surprise me if I’m going way way over a threshold here.

Thanks as always guys!

There isn’t currently, but I just looked for you and it doesn’t look too bad.

Past 24h isn’t too bad. For about a minute there (a few hours ago), ~200 TLS handshakes were queued. Those won’t be queued for a long time. Handshake latency remained pretty low (p95 definitely under 200ms).

Past 7 days is a different story.

Latency went way up as handshakes were being queued for a few seconds (sometimes).

Handshakes were primarily happening in SJC.

There are a few things we can do:

  • Higher limits for your app. We don’t have any mechanisms to do that right now, we’d have to build it. The limits on concurrent TLS handshakes per IP blocks are there to protect server resources and ensure fast handshakes for everybody.
  • More edge servers in SJC to spread the load. Limits apply per-server. We don’t usually do this unless server load goes high.

I’ll consult the team about this.

Thanks @jerome!

More edge servers in SJC to spread the load .

Can I influence this by setting a higher autoscale floor? Or by “edge server” are you talking about physical servers that are only in your control?

Also, I’d love it if I were able to track this metric somehow, as it’s invisible to me otherwise.

Thanks!

also FYI I’ve spread out the load over more apps until we figure out the TLS thing

Physical servers. That’s on our end :slight_smile:

I can get the metric published publicly for all our users. There’s no TLS metrics really.

That should help because limits are per-app. However, we need to solve that problem in a better way. We have ideas!

1 Like

The following metrics are now available:

fly_edge_tls_handshake_time_seconds // histogram
fly_edge_tls_handshake_errors // counter
fly_edge_tls_sni_limit_reached_count // counter
fly_edge_tls_ip_limit_reached_count // counter
fly_edge_tls_queue_time_seconds // histogram

Some interesting labels:

ip (limit reached metric)
servername (SNI handshook, if any)
cipher (TLS cipher used)
version (TLSv1.2 or TLSv1.3)
alpn (negotiated ALPN: http/1.1, h2)

(I have not updated our docs)

Edit: I have also increased the number of concurrent handshakes by a little bit.

1 Like

Thank you! Sorry if I’m being repetitive here…Are these metrics for fly globally? Is that what you mean by for “all our users”? Or is this metric one where each user has access to how many of their requests are queueing up, etc.

Oh, I just mean they can be queried via our prometheus-compatible metrics servers, available to each of our users: Metrics on Fly · Fly

I’m building a Grafana dashboard that include all metrics we expose to users so they can easily get started. Should be available next week or so.

@david I’ve created and shared a Grafana Dashboard (post) and added some instructions on how to import it into a Grafana instance.

The dashboard includes a TLS section that should give you the information you’re interested in.

1 Like

awesome! thank you.

@david I’ve made several significant changes on how we handle TLS handshake and how we limit them.

It means you should probably never reach the limits again unless your traffic rises to a much higher level. Can you try going back to a single app?