Intermittent TLS handshake failures (status-0/ECONNRESET), concentrated at LHR

viraj · June 29, 2026, 8:20am

Hi,

We’ve been chasing intermittent failures where clients get a status-0 “Network Error” (no HTTP response at all) when they hit our app, and it looks like it’s failing at the TLS handshake at your edge rather than anything reaching our machines. It’s bursty and it’s heavily skewed towards LHR, which is where most of our traffic comes in.

The thing that stands out is that it only affects new connections. Anyone already on an established HTTP/2 connection is completely fine, but a fresh page load or a new client connection has a decent chance of failing the handshake. That’s the same behaviour described in #28121 and #28120, and in #28121 one of you traced it to an edge having trouble fetching certificates and timing out (the 10s cert-fetch timeout lined up with the ECONNRESET timing). So my first question is really just: is this the same thing, happening again at LHR?

I don’t want to put our app name and IPs in a public thread, but here’s a recent request-id you can use to find the app and edge on your side:

fly-request-id: 01KW96VR02W9JWGTEMVFK87FMR-bom

(Happy to send the app name, hostname, IPs and org over DM whenever you need them.)

A few non-identifying details that might help: we’re on dedicated v4 + v6 anycast IPs (provisioned back in 2021), machines in lhr, cdg, ams and sin, HTTP/2 on, and a Fly-managed Let’s Encrypt cert that fly certs show reports as issued/verified/active, added about 4 years ago and expiring in roughly a month.

Here’s what we’re seeing in your own Prometheus.

One of our users hit a hard failure at 2026-06-27 11:25:16 UTC. TLS handshake errors at the LHR edge around that minute, in 10-minute buckets:

10:40        2
11:20-11:30  102
11:30-11:40  163
11:40        0

So basically nothing before or after, and a ~265-error burst right across the moment it happened.

It’s not a one-off either. increase(fly_edge_tls_handshake_errors[7d]) by region:

lhr   910   (1179 over 30d)
fra   397   (632)
nrt   299   (317)
ord   273   (390)
everything else under ~90 in 7d

At LHR those 910 handshake errors are against about 10,619 new TCP connects in the same window, so roughly 8.6% of new connections at LHR are failing the handshake. Daily LHR counts over the last couple of weeks show it’s a recurring spiky thing, not a single incident:

06-19   4
06-20  33
06-21 236
06-22  47
06-23 340
06-24  75
06-25  26
06-26  87
06-27  38   (the day a user reported it)
06-28 235
06-29  58

Handshake latency looks high too. p99 of fly_edge_tls_handshake_time_seconds over 24h sits around 0.8s at LHR and 0.9-1.0s in several other regions, with one region way out at ~9.9s. Even right now, when a curl to the app succeeds, the TLS portion alone is taking 0.1-0.5s.

We’re fairly confident it isn’t us:

The requests never reach our machines, so there’s nothing in our app logs. Health checks are all passing and fly_edge_error_count is single digits per region over 7d, against hundreds-to-thousands of handshake errors.
It’s not CORS or a 4xx — there’s no response at all, status 0.
It’s not one client or network. It reproduces across browsers and machines and tracks your edge metrics.

What I’m hoping you can help with:

Is this the same edge cert-fetch-timeout issue as #28121, recurring at LHR? The pattern and timing match.
Our cert reads as verified and active, but #27898 showed that can sit alongside a stale or missing SNI binding on an edge after a Vault desync. Would removing and re-adding the cert actually force an edge re-sync here, or is that pointless/risky given the cert is currently healthy?
Could we be hitting the per-edge TLS handshake limits from #22651 (150 concurrent per SNI, 100 per IP block, 60/s)? A single page load opens a few connections, so I want to rule it out. If that’s it, what’s the right way to handle it?
Are you able to line up the 11:20-11:40 UTC burst at LHR on 2026-06-27 with anything on your side? I’ll send the app/IP over DM so you can pin it down.
Is there a recommended way to alert on fly_edge_tls_handshake_errors so we catch this before users do?

On our end we’ve already made the client tolerate these failures so a dropped request degrades gracefully instead of breaking the page (the default 3 retries were already happening during the failures, the burst just outlasted them), so this is really about understanding and fixing the edge side.

The Prometheus queries I used, if useful:

sum by (region) (increase(fly_edge_tls_handshake_errors{app="<app>"}[7d]))
sum by (region) (increase(fly_edge_tcp_connects_count{app="<app>"}[7d]))
sum(increase(fly_edge_tls_handshake_errors{app="<app>",region="lhr"}[10m]))
histogram_quantile(0.99, sum by (le,region) (rate(fly_edge_tls_handshake_time_seconds_bucket{app="<app>"}[24h])))

Thanks!

flyio-support · June 29, 2026, 1:37pm

Hi Viraj,

We reviewed the TLS errors around June 27 and couldn’t find any corresponding spike across the lhr edges or evidence of a broader issue on our side at that time.

Your app uses dedicated IPs, so automated bots and scanners connecting to port 443 can generate TLS handshake errors in the network metrics even when legitimate traffic is unaffected.

A client-side timeout or slow connection remains a possible explanation. If real clients experienced the problem, it could also be related to their connection or the environment where the client is running. For example, we’ve previously seen TLS issues affecting clients running on particular AWS instances.

For now, we don’t see anything indicating that the app or its TLS configuration needs to be changed.