Kinda the same issue as a thread I made a while back, but this time I think the cause is different.
Around November 19th, I added a new Postgres replica and app server in ams. Before this I was only in iad. When looking at the connect span in Sentry there is an immediate jump in connection times at this point.
A similar request that happened in iad used the port 5432 as expected and had a much more reasonable connect time of about 50ms.
I’m not sure what to do really. I’ve tried destroying and recreating both the app server and Postgres replica in ams thinking that maybe one of the hosts was having issues, but that didn’t help at all. The machine IDs are e2860e0a563d28 for the app server (might change with deployments because of bluegreen? idk) and e82d921f743508 for the Postgres replica.
Well I tried updating Postgres with fly image update in hopes that maybe it’d help or something. That was a mistake. I now don’t have a primary and I have no idea what I’m doing.
Ouch… It looks like there still is at least an 800ms difference for European users…
$ fly console --region ams
# time curl -i 'https://splashcat.fly.dev/battles/7011/' | fgrep -i fly-
fly-region: ams
fly-request-id: 01HGR0J32E5Q53KW2YZS4SR22C-ams
real 0m1.292s
# time curl -i -H 'fly-prefer-region: iad' 'https://splashcat.fly.dev/battles/7013/' | fgrep -i fly-
fly-region: iad
fly-request-id: 01HGR0NQCT4H01PV8YZBBVDV59-ams
real 0m0.445s
Compared to…
$ fly console --region iad
# time curl -i 'https://splashcat.fly.dev/battles/7017/' | fgrep -i fly-
fly-region: iad
fly-request-id: 01HGR0VDY5AQE0P4G2YN3X129S-iad
real 0m0.301s
And (somewhat less reliably)…
$ fly curl 'https://splashcat.fly.dev/battles/7019/'
REGION STATUS DNS CONNECT TLS TTFB TOTAL
ams 200 3.9ms 4.2ms 430.6ms 1812.2ms 1813.6ms
dfw 200 7.6ms 7.8ms 180.1ms 793.2ms 921.5ms
ewr 200 4.3ms 4.6ms 92.9ms 411.5ms 431.2ms
fra 200 5.4ms 5.5ms 405.2ms 1860.9ms 1884.6ms
iad 200 2.8ms 3 ms 63 ms 557.1ms 559.2ms
lax 200 6.7ms 6.9ms 271.4ms 919.2ms 1159.7ms
lhr 200 80.3ms 80.8ms 546.9ms 1855 ms 1874.6ms
mia 200 2.4ms 2.5ms 197.5ms 891.6ms 1076.1ms
nrt 200 4.6ms 4.8ms 724.5ms 1075.6ms 1689.6ms
ord 200 5.3ms 5.4ms 222.9ms 773.5ms 869.1ms
sjc 200 3.4ms 3.6ms 328.3ms 921.3ms 1162.9ms
If you haven’t tried it already, it might make sense to SSH in to the replica and see how a direct connection to its unix-domain socket might work out:
$ fly ssh console -a splashcat-db --region ams
# echo "$FLY_REGION" # double-check
# su postgres
# time psql -p 5433 -c "select state, count(*) from pg_stat_activity group by state"
If even that takes hundreds of milliseconds, then it may be that your tweaks from July didn’t end up carrying over to the new node…
Didn’t know this was a thing (my Postgres knowledge is very limited). When running it I get a much more reasonable time, so I think it’d have to be something with either my app or the network between the 2 hosts in ams.
% fly ssh console -a splashcat-db --region ams
Connecting to fdaa:2:4f68:a7b:13b:a04e:cabe:2... complete
root@e82d921f743508:/# su postgres
postgres@e82d921f743508:/$ time psql -p 5433 -c "select state, count(*) from pg_stat_activity group by state"
state | count
--------+-------
| 4
active | 1
idle | 6
(3 rows)
real 0m0.046s
user 0m0.025s
sys 0m0.009s
postgres@e82d921f743508:/$ echo $FLY_REGION
ams
Tried running a similar command from an app server in ams and it has an 800ms delay.
root@e2860e0a563d28:/code# export PGPASSWORD=(meow)
root@e2860e0a563d28:/code# time psql -h "splashcat-db.flycast" -p 5433 -c "select state, count(*) from pg_stat_activity group by state" -U postgres
state | count
--------+-------
| 4
active | 1
idle | 6
(3 rows)
real 0m0.800s
user 0m0.035s
sys 0m0.009s
root@e2860e0a563d28:/code# echo $FLY_REGION
ams
Also out of curiosity, I ran this same command but through the .internal host instead so flycast doesn’t get involved and that seems to help.
root@e2860e0a563d28:/code# time psql -h "splashcat-db.internal" -p 5433 -c "select state, count(*) from pg_stat_activity group by state" -U postgres
state | count
--------+-------
| 4
active | 1
idle | 8
(3 rows)
real 0m0.264s
user 0m0.027s
sys 0m0.011s
I don’t know if I can eliminate flycast though as tmk it was created for Postgres?
Also ran the same few commands in iad. I tried both ports 5432 and 5433 for this but only included 5433 below as the results were practically the same.
root@6e8245ddc49987:/code# time psql -h "splashcat-db.flycast" -p 5433 -c "select state, count(*) from pg_stat_activity group by state" -U postgres
state | count
--------+-------
| 4
active | 1
idle | 4
(3 rows)
real 0m0.065s
user 0m0.036s
sys 0m0.005s
root@6e8245ddc49987:/code# time psql -h "splashcat-db.internal" -p 5433 -c "select state, count(*) from pg_stat_activity group by state" -U postgres
state | count
--------+-------
| 4
active | 1
idle | 4
(3 rows)
real 0m0.040s
user 0m0.022s
sys 0m0.013s
root@6e8245ddc49987:/code# echo $FLY_REGION
iad
The ip here does match with the primary’s ip, which is expected. I decided to also run it with port 5433 out of curiosity and got an IPv4 address for some reason??
root@e2860e0a563d28:/code# psql -h "splashcat-db.flycast" -p 5433 -c "select inet_server_addr()" -U postgres
Password for user postgres:
inet_server_addr
------------------
172.19.160.66
(1 row)
Oops, you are right, I really did want that 5433 there… I don’t yet know why it gives an IPv4 address in this situation, but I do see that same effect.
Overall, the idea is to try to confirm that it’s connecting to the Amsterdam replica, as opposed to getting sent all the way across the Atlantic. So, an IPv4 address might turn out to be just as useful.
If you SSH in to the various Postgres machines and ip -4 addr show eth0 inside, then perhaps you can distinguish the two continents that way?
(On Debian, ip is in the iproute2 package, and the man page for this sub-command is man ip-address.)
Seems like it’s being sent to the correct machine.
root@e82d921f743508:/# ip -4 addr show eth0
3: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1420 qdisc pfifo_fast state UP group default qlen 1000
inet 172.19.160.66/29 brd 172.19.160.71 scope global eth0
valid_lft forever preferred_lft forever
inet 172.19.160.67/29 brd 172.19.160.71 scope global secondary eth0
valid_lft forever preferred_lft forever
root@e82d921f743508:/# echo $FLY_REGION
ams
I did run the same command on every machine in the cluster just to make sure that the IPv4 address wasn’t like, something that’s always the same or something. Was worried that maybe it was something weird inside the container because of Fly.io stuff usually being IPv6.
I feel like at this point it has to be something with Fly.io’s networking in ams, especially with flycast. But even using the .internal address above in ams still had a higher latency compared to using .internal in iad. Nevermind, I just reran this and the response time is the same as in iad. So it’s just something with flycast in ams.
root@e2860e0a563d28:/code# time psql -h "splashcat-db.internal" -p 5433 -c "select state, count(*) from pg_stat_activity group by state" -U postgres
state | count
--------+-------
| 4
active | 1
idle | 8
(3 rows)
real 0m0.048s
user 0m0.031s
sys 0m0.005s
root@e2860e0a563d28:/code# echo $FLY_REGION
ams
I’m guessing the try in the quoted post being slow was either just random luck or it routed it to iad for some reason. Rerunning it several times in a row with ams.splashcat-db.internal is pretty consistently around 0.05s.
Sorry for not really saying thanks sooner >.< I feel like I struggle a lot with knowing when to say it exactly and like, I worry a lot that it’d be weird if I said thanks in like every single post. yeah i don’t know why i wrote this but i did >.< but like really thanks so much for the help with this!!