It’s looking like it has something to do with the handshake?
I’ve confirmed that all devices on the user’s network have the same slow interactions with these fly apps, and had them do some curls and traceroutes to figure out where things are going awry.
With the following CURL command against my app:
curl -v -w "\n\nTime Total: %{time_total}s\nTime Namelookup: %{time_namelookup}s\nTime Connect: %{time_connect}s\nTime Pretransfer: %{time_pretransfer}s\nTime Starttransfer: %{time_starttransfer}s\n" -o /dev/null -s "https://blork.bscotch.net"
On my machine I get:
Time Total: 0.090120s
Time Namelookup: 0.003008s
Time Connect: 0.028445s
Time Pretransfer: 0.059484s
Time Starttransfer: 0.090084s
But on their machines they see things like:
Time Total: 9.002770s
Time Namelookup: 0.007537s
Time Connect: 0.047108s
Time Pretransfer: 8.045415s
Time Starttransfer: 8.449210s
Running the same against debug.fly.dev
isn’t as bad as that, but still 2-5x slower than running it against any other sites (and only when run from this user’s devices).
Notably, the first run is always the slowest and subsequent runs are still slow but significantly better.
I had them do a tracert to one of my Fly apps, which yielded this (IPs replaced with X.Y.Z
):
1 1 ms 4 ms 1 ms router.asus.com [X.Y.Z]
2 1 ms 1 ms 1 ms dsldevice.attlocal.net [X.Y.Z]
3 2 ms 2 ms 2 ms X-Y-Z.lightspeed.stlsmo.sbcglobal.net [X.Y.Z]
4 2 ms 2 ms 2 ms X.Y.Z
5 * * * Request timed out.
6 * * * Request timed out.
7 * * * Request timed out.
8 * * * Request timed out.
9 77 ms 13 ms 13 ms X.Y.Z
10 77 ms 13 ms 13 ms be-2111-pe11.350ecermak.il.ibone.comcast.net [X.Y.Z]
11 * * * Request timed out.
12 * * * Request timed out.
13 * 43 ms 42 ms X.Y.Z
They also get nearly identical results for tracerouting debug.fly.dev
.
My trace for the same app is, if anything, worse (but I’m not having any performance problems despite that).
I added durations to my HTTP logs, and found that the slow requests take the same amount of time on-server as everyone else’s, which implies that the slowness occurs before it hits my actual app (or once it leaves).
Digging through the Fly logs, I noticed that the Handshake times include a bunch of really slow entries:
I had the user reload the app in a fresh browser tab, which was again slow as expected, and saw a new Handshake entry appear in this table in the 200ms-1s bucket. I had them repeat it a few times and each time saw another entry in the slow buckets.
Notably, the app uses a websocket connection for realtime updates (reading data) but uses plain requests to send changes. When a user first loads the page, the websocket connection gets established and then a regular request goes out to fetch the initial data. For the affected user, both of these steps are really slow on a fresh load. I assume that both require separate handshakes, which would be consistent with the handshake process being slow for some reason. Further, if they refresh right after the first load, things are still a little slow but much faster than the first time. But if they do so in a new tab, things are very slow again.
The server is configured to stay up even when there’s no traffic, so this isn’t a cold start problem. The app doesn’t have many users (it’s internal for a small team) but they’re all using it all day and I was using the app myself while having the impacted user try things, so even if there could be a cold start problem it wouldn’t be applicable here.
Does it seem like I’m on the right track here with the handshake being the problem? Are there config changes I could make to the apps (their TLS settings?) that might fix the issue? This just started late last week, so maybe a change at Fly could explain this?