Slow response times?

I’m seeing intermittent really slow response times for our application deployed on EWR and also having a hard time connecting to https://fly.io, but the status page says all systems are operational? Is this an error just with fly on my end? Any suggestions how to debug?

Will you try this with curl?

curl -v -o /dev/null -sS https://<url>

I’m curious if it’s the TLS handshake or something else.

Getting 200’s just fine that way. It’s really hard to pin this down because it’s not like the app is inaccessible or consistently just slow, but if you look at: https://c1255139-2947-4a05-98dc-6ca56ddda3d5.site.hbuptime.com/ see those response time humps earlier today, and ongoing right now?

Do you know which region it’s actually hitting by chance? If you visit debug.fly.dev you’ll see a FLY_REGION header. Your app is in ewr, but connections could be happening to another location.

That page says LGA

and EWR

It seems like the app itself might be responding slowly:

As far as we can tell, there’s no performance issue caused by the load balancing. When you had problems connecting to fly.io, what was happening?

It just timed out to a blank html page eventually. It’s happening again for me right now :face_with_monocle: (https://fly.io that is)

I typed both URLs into my browser, and it took about 15 seconds for both app.ressemble.com and fly.io to load, and they eventually loaded at the same time…

Is it possible there’s some routing error here that’s local to my geographic area?

That screenshot you posted has response times in the 40ms range yeah? That would be fine. We’re talking 10s plus to load a page here.

That screenshot is actually 40 seconds, not ms. Definitely very slow responses.

Will you run a traceroute to fly.io and paste the output?

Yeah 40 seconds is definitely not acceptable :frowning:

traceroute -I fly.io
 1  192.168.1.1 (192.168.1.1)  2.136 ms  1.670 ms *
 2  10.240.162.101 (10.240.162.101)  8.520 ms  6.838 ms  9.678 ms
 3  67.59.235.58 (67.59.235.58)  13.085 ms  12.833 ms  12.485 ms
 4  ool-4353dd18.dyn.optonline.net (67.83.221.24)  9.627 ms  21.917 ms  15.457 ms
 5  451be060.cst.lightpath.net (65.19.99.96)  14.415 ms  14.794 ms  17.509 ms
 6  64.15.2.94 (64.15.2.94)  14.811 ms  16.067 ms  14.577 ms
 7  * * *
 8  zayo.ntt.ter1.ewr1.us.zip.zayo.com (64.125.15.85)  13.353 ms  14.623 ms  21.722 ms
 9  ae-1.r20.nwrknj03.us.bb.gin.ntt.net (129.250.6.52)  27.864 ms  16.011 ms  19.263 ms
10  ae-0.a01.nycmny17.us.bb.gin.ntt.net (129.250.3.153)  13.421 ms  12.675 ms  17.250 ms
11  * * *
12  * * *
13  213.188.199.153 (213.188.199.153)  13.433 ms  18.703 ms  12.175 ms

(Traceroute over UDP still hasn’t finished, just getting an endless list of * * * assume it’ll go to 64 and then stop)

That traceroute looks fine. It’s pretty weird that you’re getting slow responses from fly.io, though. If you get another blank page will you pop open the web inspector and see what the network tab says about it?

The slow response times in that graph are actually at the app level. I don’t think they’re related to us (it would be hard for us to slow that particular metric down), but the other things you’re seeing are suspicious.

Well, the app’s database is also on Fly in the same region, so what I was guessing was that the app wouldn’t respond until it got a response back from the database (which could cause that graph no?)

Response times are looking back to normal now. Very odd and hard for me to grok why a Phoenix app would just start being slow (intermittently!) for a period of time one day then stop.

It is unlikely that the delay is between the app and the database. Not impossible, but networking in region is as simple as it gets. The complexity is all between you → our proxy → app.

We still haven’t found anything to indicate what the problem is, though. Still looking!

Thank you! Any idea if there’s other monitoring I could enable on my end to see if it’s for sure the app itself? (probably out of scope but worth asking :slight_smile: )

The simplest thing to do is expose your own Prometheus metrics to our scraper, then hook up a grafana dashboard. It’ll let you see what Phoenix itself is doing (our metrics are only what our proxy observes).

We heard from multiple customers today that response times were extraordinarily slow as well. Our NodeJS monitoring doesn’t record any anomalous response times, but the Fly.io metrics do show high latency.

We tracked down a bottleneck that might have been the cause of some of these. It’s hard to tell app by app, but we have found latency spikes between our edge proxies and app worker servers that run app VMs. These should have improved substantially over the last day.

If you’re getting weird latency spikes, check your app config concurrency setting to make sure it has type = "requests" defined like this:

  [services.concurrency]
    hard_limit = 500
    soft_limit = 250
    type = "requests"

If there’s no type defined, or it’s set to connections, your requests might get stuck waiting for a whole new TCP connection. requests will reuse an existing connection pool, which should help mitigate this problem.

2 Likes