Is Fly.io slower today?

I’m in the process of finally migrating over from our old infrastructure hosted on DigitalOcean to our new setup on Fly.io, but I’ve noticed that a basic health endpoint that used to take ~50ms on average is now up to ~300ms. (Both measurements from our setup on Fly.io)

Is the influx of customers from Heroku affecting things?

Shouldn’t be! Can you give us more details about the health endpoint and where you’re checking it from?

Yeah, you can take a look at it here: https://fly-api.usepastel.com/v1/ping
I’m checking it from Toronto, Canada.

I’ve also got another service i’m seeing a similar effect with here: https://api.geo-proxy.usepastel.com/ping

Can you check and see if you’re still getting routed to yyz? https://debug.fly.dev will show you.

I’m getting ~40ms responses from our toronto test hosts.

Looks like I’m getting routed to ewr.

=== Headers ===
Host: debug.fly.dev
Via: 2 fly.io
Sec-Ch-Ua-Platform: "macOS"
X-Request-Start: t=1661805169075764
Sec-Fetch-Mode: navigate
Fly-Client-Ip: 2607:fea8:4e20:9200:bce9:9ed7:b592:a4eb
X-Forwarded-Proto: https
Fly-Request-Id: 01GBNMX2DKSRAM2Y8DEV2YFCND-lga
Fly-Forwarded-Port: 443
Fly-Region: lga
Sec-Ch-Ua: "Chromium";v="104", " Not A;Brand";v="99", "Google Chrome";v="104"
Sec-Fetch-Site: cross-site
Sec-Fetch-User: ?1
Referer: https://community.fly.io/
Accept-Language: en-US,en;q=0.9,de;q=0.8
X-Forwarded-For: 2607:fea8:4e20:9200:bce9:9ed7:b592:a4eb, 2a09:8280:1:763f:8bdd:34d1:c624:78cd
Cache-Control: max-age=0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Fly-Forwarded-Proto: https
Fly-Forwarded-Ssl: on
Sec-Ch-Ua-Mobile: ?0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.101 Safari/537.36
Sec-Fetch-Dest: document
Accept-Encoding: gzip, deflate, br
X-Forwarded-Ssl: on
X-Forwarded-Port: 443

=== ENV ===
FLY_ALLOC_ID=a83d93b7-5110-8fae-8c3b-54b87b2b087c
FLY_APP_NAME=debug
FLY_PUBLIC_IP=2604:1380:45d1:2801:0:a83d:93b7:1
FLY_REGION=ewr
FLY_VM_MEMORY_MB=128
GPG_KEY=A035C8C19219BA821ECEA86B64E628F8D684696D
HOME=/root
LANG=C.UTF-8
PATH=/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
PYTHON_GET_PIP_SHA256=01249aa3e58ffb3e1686b7141b4e9aac4d398ef4ac3012ed9dff8dd9f685ffe0
PYTHON_GET_PIP_URL=https://github.com/pypa/get-pip/raw/d781367b97acf0ece7e9e304bf281e99b618bf10/public/get-pip.py
PYTHON_PIP_VERSION=21.2.4
PYTHON_SETUPTOOLS_VERSION=57.5.0
PYTHON_VERSION=3.10.0
TERM=linux
WS=this
is
a
test
cgroup_enable=memory

2022-08-29 20:32:49.079877833 +0000 UTC m=+202117.168678613

By the way, if it’s helpful, I’m seeing a similar response time from https://debug.fly.dev/ of ~200ms.

Some more info. My sentry.io tracing config seems to be indicating that the added latency isn’t coming from the app itself (slowest transaction it’s seen is under 100ms).

Additionally, I spun up a basic express server (no DB, no users) in the TOR region and I was seeing response times of 250-300ms on a “hello world” endpoint.

Can you also try https://debug.ipv4.fly.dev/? Your <yourapp>.ipv4.fly.dev?

If you are connecting through EWR, you’re basically making two large-ish round trips. 300ms is a little high for that journey, but not shockingly high.

If those ipv4 versions are also slow, please run a traceroute <yourapp>.ipv4.fly.dev and share it with us. We should figure out why you’re getting routed to the wrong city.

Wild, the ipv4 versions are all in the 40-50ms response time range, while the regular ones are in the ~250ms range. Beyond fixing my app, I’m also really curious as to what’s happening here on a technical level.

Probably just means we’re routing IPv6 to ewr for you, but IPv4 is still going to Toronto. debug.ipv4.fly.dev probably shows a different region in your browser, too.

Will you run traceroute6 fly-api.usepastel.com and share the output? There’s nothing sensitive in it, but it may help us fix the routing.

Funny enough, debug.ipv4.fly.dev actually shows EWR too, so maybe something more funky is happening.

=== Headers ===
Host: debug.ipv4.fly.dev
Fly-Request-Id: 01GBP5V8BHDQKAQW9M1XJCQCR3-lga
Via: 2 fly.io
Sec-Ch-Ua-Platform: "macOS"
Sec-Fetch-User: ?1
Fly-Forwarded-Proto: https
X-Forwarded-Port: 443
Fly-Forwarded-Ssl: on
Sec-Ch-Ua: "Chromium";v="104", " Not A;Brand";v="99", "Google Chrome";v="104"
Sec-Fetch-Mode: navigate
Sec-Fetch-Dest: document
Fly-Client-Ip: 99.245.23.18
Accept-Language: en-US,en;q=0.9,de;q=0.8
X-Forwarded-Proto: https
X-Forwarded-Ssl: on
Sec-Ch-Ua-Mobile: ?0
X-Request-Start: t=1661822935410000
Sec-Fetch-Site: cross-site
Accept-Encoding: gzip, deflate, br
Fly-Forwarded-Port: 443
Fly-Region: lga
Cache-Control: max-age=0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.101 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
X-Forwarded-For: 99.245.23.18, 77.83.140.164

=== ENV ===
FLY_ALLOC_ID=b4f180b9-c6b4-6a36-fd20-e178bfc8e1d3
FLY_APP_NAME=debug
FLY_PUBLIC_IP=2604:1380:45d1:5301:0:b4f1:80b9:1
FLY_REGION=ewr
FLY_VM_MEMORY_MB=128
GPG_KEY=A035C8C19219BA821ECEA86B64E628F8D684696D
HOME=/root
LANG=C.UTF-8
PATH=/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
PYTHON_GET_PIP_SHA256=01249aa3e58ffb3e1686b7141b4e9aac4d398ef4ac3012ed9dff8dd9f685ffe0
PYTHON_GET_PIP_URL=https://github.com/pypa/get-pip/raw/d781367b97acf0ece7e9e304bf281e99b618bf10/public/get-pip.py
PYTHON_PIP_VERSION=21.2.4
PYTHON_SETUPTOOLS_VERSION=57.5.0
PYTHON_VERSION=3.10.0
TERM=linux
WS=this
is
a
test
cgroup_enable=memory

2022-08-30 01:28:55.413020162 +0000 UTC m=+227496.713843230

Here’s the output from traceroute6 fly-api.usepastel.com:

traceroute6 to fly-api.usepastel.com (2a09:8280:1::5770) from 2607:fea8:4e20:9200:bce9:9ed7:b592:a4eb, 64 hops max, 12 byte packets
 1  2607:fea8:4e20:9200:e2db:d1ff:fe4d:d60c  4.766 ms  4.636 ms  4.452 ms

Just one hop which seems weird…

Wondering, if it’s IPv6 routing causing the latency issue, would removing the AAAA records for my domains solve this?

Also if it’s helpful, here’s traceroute fly-api.usepastel.com:

traceroute to fly-api.usepastel.com (37.16.29.204), 64 hops max, 52 byte packets
 1  10.0.0.1 (10.0.0.1)  4.930 ms  4.412 ms  4.494 ms
 2  99.245.22.1 (99.245.22.1)  24.736 ms  15.384 ms  13.067 ms
 3  8081-dgw02.wlfdle.rmgt.net.rogers.com (67.231.222.237)  14.610 ms  16.373 ms  17.377 ms
 4  3132-cgw01.bloor.rmgt.net.rogers.com (209.148.233.185)  21.639 ms
    3032-cgw01.bloor.rmgt.net.rogers.com (209.148.232.41)  16.830 ms
    69.63.249.82 (69.63.249.82)  20.195 ms
 5  209.148.235.210 (209.148.235.210)  85.649 ms  25.118 ms  15.983 ms
 6  ix-ae-13-0.tcore1.tnk-toronto.as6453.net (64.86.33.5)  18.630 ms  18.815 ms  26.182 ms
 7  ae-6.a00.toroon02.ca.bb.gin.ntt.net (129.250.9.170)  25.922 ms  21.858 ms  20.761 ms
 8  ae-8.r21.nwrknj03.us.bb.gin.ntt.net (129.250.2.141)  43.148 ms  37.678 ms  32.937 ms
 9  ae-1.a01.nycmny17.us.bb.gin.ntt.net (129.250.4.175)  40.542 ms  34.209 ms  37.050 ms

Try letting that IPv6 traceroute run for a few minutes?

You can use IPv4 only with your domains, but we won’t generate certificate renewals if you do. You’ll need to setup DNS verification for your certificates.

We only automatically generate certs for domains pointed IPv6 addresses. IPv6 addresses are unique for all time so this prevents certificate hijacking.

Here ya go:

traceroute6 to fly-api.usepastel.com (2a09:8280:1::5770) from 2607:fea8:4e20:9200:6810:da3:78d6:3e02, 64 hops max, 12 byte packets
 1  2607:fea8:4e20:9200:e2db:d1ff:fe4d:d60c  4.849 ms  3.929 ms  5.317 ms
 2  * * *
 3  2607:f798:10:10b9:0:672:3122:2237  22.796 ms  15.598 ms  15.565 ms
 4  2607:f798:10:10e0:0:690:6324:9082  17.652 ms
    2607:f798:10:31f:0:2091:4823:3185  18.444 ms
    2607:f798:10:ea45:0:721:3913:6086  13.716 ms
 5  2607:f798:10:359:0:2091:4823:5210  18.275 ms  18.078 ms  20.329 ms
 6  xe-11-0-1.edge2.washington1.level3.net  19.126 ms  37.411 ms  18.369 ms
 7  ntt-level3-toronto1.level3.net  70.798 ms  67.149 ms  76.347 ms
 8  ae-8.r21.nwrknj03.us.bb.gin.ntt.net  73.541 ms  99.623 ms  79.841 ms
 9  ae-1.a01.nycmny17.us.bb.gin.ntt.net  81.012 ms  79.088 ms  83.888 ms
10  2001:418:0:5000::1e13  88.381 ms  121.179 ms  89.290 ms
11  2607:f740:70:101::6  84.390 ms  92.023 ms  84.476 ms

Re: certs, aw man that sounds like a no-go then.

To clarify re: certificates, I was just reading through the docs here (Custom Domains and SSL Certificates · Fly Docs). If I’ve already set up DNS verification, and then I remove the AAAA record, will certificate renewals still happen automatically?

If I’ve already set up DNS verification, and then I remove the AAAA record, will certificate renewals still happen automatically?

Yup, that’ll work! The only thing that would change would be the challenge type. One thing to keep in mind with DNS-01 at the moment, however, is that you’ll want to create certs a few days apart. for potentially overlapping _acme-challenge records (like a wildcard and an apex domain).

This will make sure that our DNS has plenty of time to use the correct TXT record for each challenge.

OK I’ll give that a shot while waiting for the IPv6 slowness issue to get sorted. Any ideas yet on what could be causing that one?

Potentially a related issue, potentially not. For context, I’m looking into sources of latency for my app and I’m taking a look at the database I have hosted on DigitalOcean in their TOR1 region. The latency going from Fly.io YYZ region → DigitalOcean TOR1 region seems to be ~15ms, but going the other way from DO TOR1 → Fly.io YYZ its ~0.5ms.

Any idea why that would be?

Source here is SSH-ing into my corresponding services on each platform and then ping-ing its equivalent on the other platform.

DO pinged from fly.io

--- api.usepastel.com ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 14.990/15.243/15.611 ms

fly.io pinged from DO

--- fly-api.usepastel.com ping statistics ---
6 packets transmitted, 6 received, 0% packet loss, time 5003ms
rtt min/avg/max/mdev = 0.446/0.664/1.452/0.356 ms

Hey folks, I just discovered that one of my fly.io VMs in the YYZ region has ~1ms latency to my Postgres DB (and other services like Redis) hosted on in the DigitalOcean TOR1 region, but every other VM has ~15ms latency to the same services.

Some questions:

  • What’s going on here and how do I ensure that all my VMs (or at least the ones for my backend service) have ~1ms latency rather than ~15ms?
  • Are there separate YYZ regions/zones?

Would love some info here as this really has an effect on my app’s overall latency as the cache & DB requests stack up.

VM with ~1ms latency

❯ fly ssh console -a pastel-frontend -s
Update available 0.0.382 -> v0.0.385.
Run "fly version update" to upgrade.
? Select instance: yyz (fdaa:0:756b:a7b:aa2:6019:cf16:2)
Connecting to [fdaa:0:756b:a7b:aa2:6019:cf16:2]... complete
/ # ping api.usepastel.com
PING api.usepastel.com (174.138.112.93): 56 data bytes
64 bytes from 174.138.112.93: seq=0 ttl=58 time=1.482 ms
64 bytes from 174.138.112.93: seq=1 ttl=58 time=0.782 ms
64 bytes from 174.138.112.93: seq=2 ttl=58 time=0.716 ms
64 bytes from 174.138.112.93: seq=3 ttl=58 time=0.716 ms
64 bytes from 174.138.112.93: seq=4 ttl=58 time=0.721 ms
64 bytes from 174.138.112.93: seq=5 ttl=58 time=0.696 ms
64 bytes from 174.138.112.93: seq=6 ttl=58 time=0.637 ms
^C
--- api.usepastel.com ping statistics ---
7 packets transmitted, 7 packets received, 0% packet loss
round-trip min/avg/max = 0.637/0.821/1.482 ms

VM with ~15ms latency

❯ fly ssh console -a pastel-frontend -s
Update available 0.0.382 -> v0.0.385.
Run "fly version update" to upgrade.
? Select instance: yyz (fdaa:0:756b:a7b:88dc:5b8f:28a:2)
Connecting to [fdaa:0:756b:a7b:88dc:5b8f:28a:2]... complete
/ # ping api.usepastel.com
PING api.usepastel.com (174.138.112.93): 56 data bytes
64 bytes from 174.138.112.93: seq=0 ttl=52 time=15.486 ms
64 bytes from 174.138.112.93: seq=1 ttl=52 time=14.930 ms
64 bytes from 174.138.112.93: seq=2 ttl=52 time=14.926 ms
64 bytes from 174.138.112.93: seq=3 ttl=52 time=15.012 ms
^C
--- api.usepastel.com ping statistics ---
4 packets transmitted, 4 packets received, 0% packet loss
round-trip min/avg/max = 14.926/15.088/15.486 ms