Application slow for a single user... how?

I’ve got two apps running on Fly, and one of my users has really slow interactions with both of those apps, while all other users are fine. This started about a week ago, before that they weren’t having any trouble. It’s an internal app and that user is on my team, so we’ve been debugging together.

They have a faster internet connection than I do and a similar machine setup (latest Windows 11, fast hardware, no antivirus or anything else weird running). They’re also geographically close to me (a few miles away).

They’re getting slow interactions in Chrome, Edge, and a non-browser client application.

They only seem to be having this trouble with these fly.io apps. Our other internal services are older AWS applications, which run fine, and there isn’t any obvious trouble with any other website or service. So it seems to be fly.io-specific?

I just cannot for the life of me guess why we’re having such disparate experiences. Everything seems to be pointing to some kind of partial incompatibility with their device, local network, or ISP rather than a particular client. Maybe something to do with the Anyone have any insight on what could be causing this, or how we’d go about diagnosing it? Has anything in the last week changed with how Fly machines/apps talk to clients?

I figured I’d try experimenting with the http_service.http_options and http_service.tls_options configurations, but I can’t find info about what the defaults are for those and am not that familiar with how things can go wrong there.

Hi,

Strange. Network issues are always hard to debug because of the number of variables. For Fly to debug they would probably need a traceroute to the host.

One thing you could check is whether they are only having issues with those Fly apps … or all Fly apps. If it’s just those two, it would point at the app/code being the issue and/or those specific regions. If it’s all apps, it would suggest a network/routing issue. There is a https://debug.fly.dev/ they could perhaps load. It reveals their IP etc but you both could load that and compare how fast that is to load. Also check what region the page is being served from. It should be the closest one but it can be one far away. That could perhaps explain why it’s slow, if you are getting a response from a closer region.

Another idea could be to test their connection with a tool like https://speed.cloudflare.com/ That is more comprehensive as it tests latency, what server (Cloudlare of course here) is being connected to, ISP etc. As well as the speed. If they get any weirdness in that, Fly servers are probably (total guess!) located somewhere nearby.

Thanks for the tips! I’ll give all of that a try.

1 Like

Hi @adam-coster,

Even if you and the other user are geographically close, you might be using different ISPs and his ISP may have a wonky route to Fly.io servers that could explain the behavior you’re seeing.

You can try doing a traceroute or mtr to get a report of where the packets are going. If you and your user do it and then compare the routes, it might point to strangeness in how traffic is being routed.

If you’re able to share your user’s traceroute we can check on our side for need to adjust - though in many cases it’s actually the ISP using some weird routing configuration and sending packets clear across the continent or something.

mtr -bzo "LSD NABWV" -rw debug.fly.dev should do it.

Regards,

  • Daniel

It’s looking like it has something to do with the handshake?

I’ve confirmed that all devices on the user’s network have the same slow interactions with these fly apps, and had them do some curls and traceroutes to figure out where things are going awry.

With the following CURL command against my app:

curl -v -w "\n\nTime Total: %{time_total}s\nTime Namelookup: %{time_namelookup}s\nTime Connect: %{time_connect}s\nTime Pretransfer: %{time_pretransfer}s\nTime Starttransfer: %{time_starttransfer}s\n" -o /dev/null -s "https://blork.bscotch.net"

On my machine I get:

Time Total: 0.090120s
Time Namelookup: 0.003008s
Time Connect: 0.028445s
Time Pretransfer: 0.059484s
Time Starttransfer: 0.090084s

But on their machines they see things like:

Time Total: 9.002770s
Time Namelookup: 0.007537s
Time Connect: 0.047108s
Time Pretransfer: 8.045415s
Time Starttransfer: 8.449210s

Running the same against debug.fly.dev isn’t as bad as that, but still 2-5x slower than running it against any other sites (and only when run from this user’s devices).

Notably, the first run is always the slowest and subsequent runs are still slow but significantly better.

I had them do a tracert to one of my Fly apps, which yielded this (IPs replaced with X.Y.Z):

  1     1 ms     4 ms     1 ms  router.asus.com [X.Y.Z]
  2     1 ms     1 ms     1 ms  dsldevice.attlocal.net [X.Y.Z]
  3     2 ms     2 ms     2 ms  X-Y-Z.lightspeed.stlsmo.sbcglobal.net [X.Y.Z]
  4     2 ms     2 ms     2 ms  X.Y.Z
  5     *        *        *     Request timed out.
  6     *        *        *     Request timed out.
  7     *        *        *     Request timed out.
  8     *        *        *     Request timed out.
  9    77 ms    13 ms    13 ms  X.Y.Z
 10    77 ms    13 ms    13 ms  be-2111-pe11.350ecermak.il.ibone.comcast.net [X.Y.Z]
 11     *        *        *     Request timed out.
 12     *        *        *     Request timed out.
 13     *       43 ms    42 ms  X.Y.Z

They also get nearly identical results for tracerouting debug.fly.dev.

My trace for the same app is, if anything, worse (but I’m not having any performance problems despite that).

I added durations to my HTTP logs, and found that the slow requests take the same amount of time on-server as everyone else’s, which implies that the slowness occurs before it hits my actual app (or once it leaves).

Digging through the Fly logs, I noticed that the Handshake times include a bunch of really slow entries:

I had the user reload the app in a fresh browser tab, which was again slow as expected, and saw a new Handshake entry appear in this table in the 200ms-1s bucket. I had them repeat it a few times and each time saw another entry in the slow buckets.

Notably, the app uses a websocket connection for realtime updates (reading data) but uses plain requests to send changes. When a user first loads the page, the websocket connection gets established and then a regular request goes out to fetch the initial data. For the affected user, both of these steps are really slow on a fresh load. I assume that both require separate handshakes, which would be consistent with the handshake process being slow for some reason. Further, if they refresh right after the first load, things are still a little slow but much faster than the first time. But if they do so in a new tab, things are very slow again.

The server is configured to stay up even when there’s no traffic, so this isn’t a cold start problem. The app doesn’t have many users (it’s internal for a small team) but they’re all using it all day and I was using the app myself while having the impacted user try things, so even if there could be a cold start problem it wouldn’t be applicable here.


Does it seem like I’m on the right track here with the handshake being the problem? Are there config changes I could make to the apps (their TLS settings?) that might fix the issue? This just started late last week, so maybe a change at Fly could explain this?

1 Like

Thanks for the great level of detail!

One thing I didn’t see in your curl diagnostics is SSL handshake timings. Can you try this?

curl -I -H "flyio-debug: doit" -w "dns_resolution: %{time_namelookup}, tcp_established: %{time_connect}, ssl_handshake_done: %{time_appconnect}, TTFB: %{time_starttransfer}\n"

If you can also share the request-id response header’s value that’d be helpful for me to trace things on our side.

I think it’s unlilkely we made a config or routing change that affects only one user in one specific region and nobody else; what we’ve seen is ISPs or carriers doing those changes in weird ways and they can affect users in very concrete regions. e.g. we’ve seen “users on T-Mobile in Seattle has trouble accessing Fly.io sites” and it came down to routing changes done by T-Mobile, not something on our side.

  • Daniel

I had the impacted user install and enable Cloudflare’s “Warp” VPN to see if that would resolve the issue, and it looks like it does! I think that implies the problem is somewhere along the journey, in the ISP or something, and on neither ours nor Fly’s side.

For completeness, with the VPN turned OFF the affected user gets:

For https://debug.fly.dev:

dns_resolution: 0.083132, tcp_established: 0.120403, ssl_handshake_done: 0.399408, TTFB: 0.534414

For the Fly app at https://blork.bscotch.net:

dns_resolution: 0.004833, tcp_established: 1.054027, ssl_handshake_done: 1.151950, TTFB: 1.194600

With the VPN turned ON the user gets:

For https://debug.fly.dev:

dns_resolution: 0.021978, tcp_established: 0.025030, ssl_handshake_done: 0.057870, TTFB: 0.123220

For the Fly app at https://blork.bscotch.net:

dns_resolution: 0.004741, tcp_established: 0.007424, ssl_handshake_done: 0.034780, TTFB: 0.065747


So there’s a huge improvement for both targets when using a VPN, but especially my app. While both my app and Fly’s dev site work way better with the VPN turned on, my own Fly app goes from being extremely bad to basically the same (or better) than Fly’s dev site.

Anyway, am I correct in interpreting this as an ISP or other issue between the impacted user and Fly, rather than an issue on either Fly’s or the user’s end?

Hi @adam-coster -

I agree with the assessment that it’s likely the path packets are taking from your user, through their ISP (likely some fault here) to the closest Fly.io edge.

Just to rule out a single wonky edge on our side, I wonder if you could ask your user to do:

curl -I -H “flyio-debug: doit” https://blork.bscotch.net

(using no vpn so we get the full info from the slow-handshake attempts).

The output will look like this

fly-region: qro
remote-addr: 172.16.128.186:48070
date: Thu, 01 Aug 2024 16:49:53 GMT
content-length: 621
content-type: text/plain; charset=utf-8
server: Fly/9fe23f3e1 (2024-07-31)
via: 2 fly.io
fly-request-id: 01J47DJ5HTYCF0YHFP60E4DMRK-qro
flyio-debug: {"n":"edge-cf-qro1-eb90","nr":"qro","ra":"long-ipv6-address","rf":"Verbatim","sr":"qro","sdc":"qro1","sid":"9e784ee93f9783","st":0,"nrtt":0,"bn":"worker-cf-qro1-a775","fbn":null}

What I’d like to have is the fly-request-id and flyio-debug headers/values from the response to see which edge is processing these requests. The flyio-debug header will contain your user’s public IP address in the “ra” element; it’d be useful to have it but if you prefer to remove it for privacy reasons that’s also fine :slight_smile: I care most about the request-id and the “n” element in the flyio-debug thingy (that’ll be the edge handling these requests).

To be clear, it’s unlikely to be an edge because I think otherwise we’d be seeing a lot of reports from users hitting the same edge as your user, which has not been the case.

Thanks!

They didn’t quite get what you showed there (no debug object, in particular), but here’s the result:

x-powered-by: Express
vary: Origin
access-control-allow-credentials: true
x-robots-tag: noindex, nofollow
blork-server-version: 0.15.0
blork-site-version: 0.16.0
content-length: 7322
content-type: text/html
etag: "y473c2"
x-sveltekit-page: true
date: Thu, 01 Aug 2024 17:15:08 GMT
connection: keep-alive
keep-alive: timeout=5
server: Fly/9fe23f3e1 (2024-07-31)
via: 1.1 fly.io
fly-request-id: 01J47F0D5M6BZ4H9MJMR0PY7ZZ-ord

With the VPN turned back on everything is the same, except the new ID looks like it’s from Chicago:

fly-request-id: 01J47EYDB337G2FMXNRN20YYJT-chi

We’re running curl from Git Bash on Windows, which might have something to do with the difference in output?

Hi Adam,

You’d get that output (ie no flyio-debug item) ifyou’re not passing the “flyio-debug: doit” header. Can you double-check? Pass this to curl:

-H “flyio-debug: doit”

This works for me on Linux, there’s a slight possibility this could be different on Windows but I don’t think so? (sorry haven’t touched Windows in decades!).

Both chi and ord are in the Chicago region; it has two identifiers, the obvious chi and the non-obvious ord (airport code for Chicago O’Hare!). Just knowing the requests go through Chicago is useful! but if you do manage to get the extra output with curl that’d be mega useful.

  • Daniel

Confirmed I had the command right, apparently the curl that comes with Git Bash for Windows doesn’t quite work the same.

PowerShell has its own thing though, so I had them run the following command:

$response = Invoke-WebRequest -Uri "https://blork.bscotch.net" -Headers @{ "flyio-debug" = "doit" }
$response.Headers | Format-List

This yielded the following (relevant) headers with the VPN off:

fly-request-id: 01J47KQ3TMR7DNM7F009Z0BXWT-ord
flyio-debug: {"n":"edge-cf-ord1-7883","nr":"ord","ra":"99.145.176.20","rf":"Verbatim","sr":"ord","sdc":"ord1","sid":"811370f94d5998","st":0,"nrtt":2,"bn":"worker-cf-ord1-d837","fbn":null}

And with the VPN on:

fly-request-id: 01J47KNT8YS4VG25X256XXYJNH-chi
flyio-debug: {"n":"edge-nac-chi1-eba3","nr":"chi","ra":"2a09:bac1:76c0:ac78::5e:20","rf":"Verbatim","sr":"ord","sdc":"ord1", "sid":"811370f94d5998","st":0,"nrtt":2,"bn":"worker-cf-ord1-d837","fbn":null}

And they confirmed that with the VPN off requests are still very slow.


I should have recognized ORD since I went to college in Chicago… that was some years ago though!

Hi @adam-coster :wave:

Infra team here – we have just made some adjustments in our network. Could you check whether things have changed (for better or for worse) for you / this user? Can you redo the debug requests above and show us what flyio-debug looks like now?

Hi!

The user confirms that the app is no longer slow for them, with or without VPN.

Without VPN, they now get the following:

Key   : fly-request-id
Value : 01J47SCNXH2JY8XQRSS68XFYB1-iad

Key   : flyio-debug
Value : {"n":"edge-cf-iad2-bf81","nr":"iad","ra":"99.145.176.20","rf":"Verbatim","sr":"ord","sdc":"ord1","sid":"811370f94d5998","st":0,"nrtt":20,"bn":"worker-cf-ord1-d837","fbn":null}

So it looks like it’s now hitting a different edge server?

@adam-coster Yes, we suspect that there is a routing issue between the customer and the Chicago edge servers. Any chance the user can run a traceroute or mtr to the IP of your app? That would also help us diagnose what might be wrong here.

Are you looking for something different than the traceroute I posted higher up the thread? Just want to make sure I collect whatever’s useful.

The user is on a Windows machine and not particularly familiar with CLI tools (and definitely not WSL), so we’re stuck with whatever I can easily walk them through. Pasting stuff into PowerShell or Git Bash is doable, but not much more than that.

The traceroute data earlier in the thread is from Windows’ tracert. I’m not sure how much that differs from the Linuxy traceroute.

@adam-coster Sorry! I forgot about the original traceroute, but I was also wondering if changing the destination edge also swapped out some of the bad routes in the middle (if there was any). In an ideal world, I would love to have an mtr both before and after because that will show losses in addition to the hops, but I understand this is nowhere near straightforward for non-technical users.

We do have some hypothesis about why the original Chicago edge didn’t work well for the user – we will also be digging deeper on our part for this.

Gotcha. Not sure if this is helpful, but it’s probably the last info I can provide:

I had the user run tracert against the app to compare to the earlier data. Everything looked the same except the last few entries:

When things were slow:

 10    77 ms    13 ms    13 ms  be-2111-pe11.350ecermak.il.ibone.comcast.net [96.110.33.194]
 11     *        *        *     Request timed out.
 12     *        *        *     Request timed out.
 13     *       43 ms    42 ms  66.241.124.49

Now that things are fine:

 10    96 ms    43 ms    30 ms  be-3212-pe12.111eighthave.ny.ibone.comcast.net [96.110.34.38]
 11    92 ms    30 ms    28 ms  96-87-9-42-static.hfc.comcastbusiness.net [96.87.9.42]
 12    28 ms    28 ms    27 ms  66.241.124.49

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.