We’re running a small API app on Fly.io (shared IPv4, dedicated IPv6, custom domain with CNAME). Looking at the Fly Edge Grafana TLS dashboard, we consistently see handshake errors on edges ord, iad and dfw.
The “Handshake errors” panel shows a steady trickle of failures (max ~25 on ord, a few on dfw), while the “Handshake times” panel regularly shows requests hitting +Inf buckets.
Is this a normal baseline for apps on shared IPv4? Are these typically caused by bots/scanners, or do they indicate real client failures?
We recently had two users in the US unable to connect due to TLS failures (tls: first record does not look like a TLS handshake / tlsv1 alert protocol version), and we’re trying to understand whether the background error rate we see in Grafana is related or just noise.
Also curious if we should allocate dedicated IPv4 to fix this problem?
It’s hard to say what’s going on with handshake failures since client-side errors also show up as handshake failures, and we do see bots scanning our IP space almost constantly, especially shared IPs. Our edges also have rate limits per IP address, so it is possible that your IP happened to hit the rate limit when your users reported TLS errors. Is it possible to share the name of your app (or even just a machine ID from it)? I can’t say I will be able to tell exactly what went wrong, but if there are obvious problems I might be able to spot it.
Using a dedicated IP should definitely help in this case as well.
Thank you for your reply!
App name: mobai-api, machine ID: 8ed724b7ee4568 (currently in fra). Previously it was machine 0805091b2762d8 in iad when the errors occurred (Mar 7-9).
We’re also seeing timeouts (not just TLS failures) from the iad edge. 10 concurrent requests to a trivial /health endpoint (returns static JSON), 4/10 time out after 10s. TLS handshake completes fine but 0 bytes received. Machine is in fra, dedicated IPv4.
Were you actually hitting the iad edge? We had some routing issues just now that directed some traffic from the US to Tokyo. This has been resolved and I’m wondering if that’s related. Are you still seeing these timeouts? If you are, can you curl -v -H 'flyio-debug: doit' and show us one successful response this way?
I no longer see the timeouts. Here’s a successful response with debug headers:
flyio-debug: {“n”:“edge-cf-iad2-bf81”,“nr”:“iad”,“ra”:“149.88.18.223”,“rf”:“Verbatim”,“sr”:“fra”,“sdc”:“fra2”,“sid”:“8ed724b7ee4568”,“st”:0,“nrtt”:0,“bn”:“worker-cf-fra2-7cbc”,“mhn”:“edge-cf-fra2-ad2e”,“mrtt”:87}
Everything looks good now.
Could the earlier TLS handshake failures (Mar 7-9) our users experienced also have been caused by similar routing issues?
The routing issue today was quite brief and was fixed pretty quickly. We did have a slightly longer incident of this kind on Mar 5 (which you can find on our statuspage), so I’m really not sure what’s going on for your users between Mar 7-9. I wonder if it is possible to have them run a curl like this to identify exactly which regions they’re hitting? As mentioned before, due to the sheer amount of noise we have on our edges (scanning bots etc.), and that it counts the entire TLS handshake (so any client-side delay / error shows up there too), it’s not really easy to tell from TLS handshake metrics alone whether something weird is happening.
Thanks for looking into this. I’ll ask both users to run this curl next time they see the issue.
i’ve also since allocated a dedicated IPv4 (was on shared before) so hopefully that helps.
I’ll report back with their debug output if the issue recurs. Thank you!
1 Like