I’m seeing intermittent really slow response times for our application deployed on EWR and also having a hard time connecting to https://fly.io, but the status page says all systems are operational? Is this an error just with fly on my end? Any suggestions how to debug?
Will you try this with curl?
curl -v -o /dev/null -sS https://<url>
I’m curious if it’s the TLS handshake or something else.
Getting 200’s just fine that way. It’s really hard to pin this down because it’s not like the app is inaccessible or consistently just slow, but if you look at: https://c1255139-2947-4a05-98dc-6ca56ddda3d5.site.hbuptime.com/ see those response time humps earlier today, and ongoing right now?
Do you know which region it’s actually hitting by chance? If you visit debug.fly.dev you’ll see a FLY_REGION
header. Your app is in ewr, but connections could be happening to another location.
That page says LGA
and EWR
It seems like the app itself might be responding slowly:
As far as we can tell, there’s no performance issue caused by the load balancing. When you had problems connecting to fly.io, what was happening?
It just timed out to a blank html page eventually. It’s happening again for me right now (https://fly.io that is)
I typed both URLs into my browser, and it took about 15 seconds for both app.ressemble.com and fly.io to load, and they eventually loaded at the same time…
Is it possible there’s some routing error here that’s local to my geographic area?
That screenshot you posted has response times in the 40ms range yeah? That would be fine. We’re talking 10s plus to load a page here.
That screenshot is actually 40 seconds, not ms. Definitely very slow responses.
Will you run a traceroute to fly.io
and paste the output?
Yeah 40 seconds is definitely not acceptable
traceroute -I fly.io
1 192.168.1.1 (192.168.1.1) 2.136 ms 1.670 ms *
2 10.240.162.101 (10.240.162.101) 8.520 ms 6.838 ms 9.678 ms
3 67.59.235.58 (67.59.235.58) 13.085 ms 12.833 ms 12.485 ms
4 ool-4353dd18.dyn.optonline.net (67.83.221.24) 9.627 ms 21.917 ms 15.457 ms
5 451be060.cst.lightpath.net (65.19.99.96) 14.415 ms 14.794 ms 17.509 ms
6 64.15.2.94 (64.15.2.94) 14.811 ms 16.067 ms 14.577 ms
7 * * *
8 zayo.ntt.ter1.ewr1.us.zip.zayo.com (64.125.15.85) 13.353 ms 14.623 ms 21.722 ms
9 ae-1.r20.nwrknj03.us.bb.gin.ntt.net (129.250.6.52) 27.864 ms 16.011 ms 19.263 ms
10 ae-0.a01.nycmny17.us.bb.gin.ntt.net (129.250.3.153) 13.421 ms 12.675 ms 17.250 ms
11 * * *
12 * * *
13 213.188.199.153 (213.188.199.153) 13.433 ms 18.703 ms 12.175 ms
(Traceroute over UDP still hasn’t finished, just getting an endless list of * * * assume it’ll go to 64 and then stop)
That traceroute looks fine. It’s pretty weird that you’re getting slow responses from fly.io, though. If you get another blank page will you pop open the web inspector and see what the network tab says about it?
The slow response times in that graph are actually at the app level. I don’t think they’re related to us (it would be hard for us to slow that particular metric down), but the other things you’re seeing are suspicious.
Well, the app’s database is also on Fly in the same region, so what I was guessing was that the app wouldn’t respond until it got a response back from the database (which could cause that graph no?)
Response times are looking back to normal now. Very odd and hard for me to grok why a Phoenix app would just start being slow (intermittently!) for a period of time one day then stop.
It is unlikely that the delay is between the app and the database. Not impossible, but networking in region is as simple as it gets. The complexity is all between you → our proxy → app.
We still haven’t found anything to indicate what the problem is, though. Still looking!
Thank you! Any idea if there’s other monitoring I could enable on my end to see if it’s for sure the app itself? (probably out of scope but worth asking )
The simplest thing to do is expose your own Prometheus metrics to our scraper, then hook up a grafana dashboard. It’ll let you see what Phoenix itself is doing (our metrics are only what our proxy observes).
We heard from multiple customers today that response times were extraordinarily slow as well. Our NodeJS monitoring doesn’t record any anomalous response times, but the Fly.io metrics do show high latency.
We tracked down a bottleneck that might have been the cause of some of these. It’s hard to tell app by app, but we have found latency spikes between our edge proxies and app worker servers that run app VMs. These should have improved substantially over the last day.
If you’re getting weird latency spikes, check your app config concurrency setting to make sure it has type = "requests"
defined like this:
[services.concurrency]
hard_limit = 500
soft_limit = 250
type = "requests"
If there’s no type defined, or it’s set to connections
, your requests might get stuck waiting for a whole new TCP connection. requests
will reuse an existing connection pool, which should help mitigate this problem.