Fly Postgres Connections to Replicas Slow

Kinda the same issue as a thread I made a while back, but this time I think the cause is different.

Around November 19th, I added a new Postgres replica and app server in ams. Before this I was only in iad. When looking at the connect span in Sentry there is an immediate jump in connection times at this point.


When looking into it further, I found that the connection span took a very long time on instances in ams.

This is just one sample, but I’ve looked at dozens of them and seen similar results with connect taking at least several hundred milliseconds. Also note that the port used is 5433, which is expected. The stuff for determining the port to use is at splashcat/splashcat/settings.py at 67763575255fa41fd6f53a56205f4b6ceec343ad · splashcat-ink/splashcat · GitHub

A similar request that happened in iad used the port 5432 as expected and had a much more reasonable connect time of about 50ms.

I’m not sure what to do really. I’ve tried destroying and recreating both the app server and Postgres replica in ams thinking that maybe one of the hosts was having issues, but that didn’t help at all. The machine IDs are e2860e0a563d28 for the app server (might change with deployments because of bluegreen? idk) and e82d921f743508 for the Postgres replica.

Well I tried updating Postgres with fly image update in hopes that maybe it’d help or something. That was a mistake. I now don’t have a primary and I have no idea what I’m doing.

Well I think I fixed Postgres exploding (but don’t think I fixed the slow connecting). Maybe. That was a fun experience.

Is the connection from the application server to the database in the private network?

1 Like

Yeah, is using a flycast address which was what flyctl gave me when I made the cluster.

Ouch… It looks like there still is at least an 800ms difference for European users…

$ fly console --region ams
# time curl -i 'https://splashcat.fly.dev/battles/7011/' | fgrep -i fly-
fly-region: ams
fly-request-id: 01HGR0J32E5Q53KW2YZS4SR22C-ams

real    0m1.292s

# time curl -i -H 'fly-prefer-region: iad' 'https://splashcat.fly.dev/battles/7013/' | fgrep -i fly-
fly-region: iad
fly-request-id: 01HGR0NQCT4H01PV8YZBBVDV59-ams

real    0m0.445s

Compared to…

$ fly console --region iad
# time curl -i 'https://splashcat.fly.dev/battles/7017/' | fgrep -i fly-
fly-region: iad
fly-request-id: 01HGR0VDY5AQE0P4G2YN3X129S-iad

real    0m0.301s

And (somewhat less reliably)…

$ fly curl 'https://splashcat.fly.dev/battles/7019/'
REGION  STATUS  DNS     CONNECT  TLS     TTFB            TOTAL    
ams     200     3.9ms    4.2ms   430.6ms 1812.2ms        1813.6ms
dfw     200     7.6ms    7.8ms   180.1ms  793.2ms         921.5ms 
ewr     200     4.3ms    4.6ms    92.9ms  411.5ms         431.2ms 
fra     200     5.4ms    5.5ms   405.2ms 1860.9ms        1884.6ms
iad     200     2.8ms    3  ms    63  ms  557.1ms         559.2ms 
lax     200     6.7ms    6.9ms   271.4ms  919.2ms        1159.7ms
lhr     200    80.3ms   80.8ms   546.9ms 1855  ms        1874.6ms
mia     200     2.4ms    2.5ms   197.5ms  891.6ms        1076.1ms
nrt     200     4.6ms    4.8ms   724.5ms 1075.6ms        1689.6ms
ord     200     5.3ms    5.4ms   222.9ms  773.5ms         869.1ms 
sjc     200     3.4ms    3.6ms   328.3ms  921.3ms        1162.9ms

If you haven’t tried it already, it might make sense to SSH in to the replica and see how a direct connection to its unix-domain socket might work out:

$ fly ssh console -a splashcat-db --region ams
# echo "$FLY_REGION"  # double-check
# su postgres
# time psql -p 5433 -c "select state, count(*) from pg_stat_activity group by state"

If even that takes hundreds of milliseconds, then it may be that your tweaks from July didn’t end up carrying over to the new node…

1 Like

Didn’t know this was a thing (my Postgres knowledge is very limited). When running it I get a much more reasonable time, so I think it’d have to be something with either my app or the network between the 2 hosts in ams.

% fly ssh console -a splashcat-db --region ams
Connecting to fdaa:2:4f68:a7b:13b:a04e:cabe:2... complete
root@e82d921f743508:/# su postgres
postgres@e82d921f743508:/$ time psql -p 5433 -c "select state, count(*) from pg_stat_activity group by state"
 state  | count 
--------+-------
        |     4
 active |     1
 idle   |     6
(3 rows)


real	0m0.046s
user	0m0.025s
sys	0m0.009s
postgres@e82d921f743508:/$ echo $FLY_REGION
ams

Tried running a similar command from an app server in ams and it has an 800ms delay.

root@e2860e0a563d28:/code# export PGPASSWORD=(meow)
root@e2860e0a563d28:/code# time psql -h "splashcat-db.flycast" -p 5433 -c "select state, count(*) from pg_stat_activity group by state" -U postgres
 state  | count 
--------+-------
        |     4
 active |     1
 idle   |     6
(3 rows)


real	0m0.800s
user	0m0.035s
sys	0m0.009s
root@e2860e0a563d28:/code# echo $FLY_REGION
ams

Also out of curiosity, I ran this same command but through the .internal host instead so flycast doesn’t get involved and that seems to help.

root@e2860e0a563d28:/code# time psql -h "splashcat-db.internal" -p 5433 -c "select state, count(*) from pg_stat_activity group by state" -U postgres
 state  | count 
--------+-------
        |     4
 active |     1
 idle   |     8
(3 rows)


real	0m0.264s
user	0m0.027s
sys	0m0.011s

I don’t know if I can eliminate flycast though as tmk it was created for Postgres?

Also ran the same few commands in iad. I tried both ports 5432 and 5433 for this but only included 5433 below as the results were practically the same.

root@6e8245ddc49987:/code# time psql -h "splashcat-db.flycast" -p 5433 -c "select state, count(*) from pg_stat_activity group by state" -U postgres
 state  | count 
--------+-------
        |     4
 active |     1
 idle   |     4
(3 rows)


real	0m0.065s
user	0m0.036s
sys	0m0.005s
root@6e8245ddc49987:/code# time psql -h "splashcat-db.internal" -p 5433 -c "select state, count(*) from pg_stat_activity group by state" -U postgres
 state  | count 
--------+-------
        |     4
 active |     1
 idle   |     4
(3 rows)


real	0m0.040s
user	0m0.022s
sys	0m0.013s
root@6e8245ddc49987:/code# echo $FLY_REGION
iad

These were really good things to try, :black_cat:… I’m wondering what they could actually be resolving to now, though…

How do the following compare, when run on an app server in ams?

# dig +short AAAA splashcat-db.internal  # should have multiple results
# dig +short AAAA top1.nearest.of.splashcat-db.internal
# dig +short AAAA iad.splashcat-db.internal
# dig +short AAAA ams.splashcat-db.internal

(On Debian, dig is in the dnsutils package.)

1 Like
root@e2860e0a563d28:/code# dig +short AAAA splashcat-db.internal 
fdaa:2:4f68:a7b:13b:a04e:cabe:2
fdaa:2:4f68:a7b:1db:726c:5eae:2
fdaa:2:4f68:a7b:1dc:6a63:49ed:2
fdaa:2:4f68:a7b:22c:581d:9e34:2
root@e2860e0a563d28:/code# dig +short AAAA top1.nearest.of.splashcat-db.internal
fdaa:2:4f68:a7b:13b:a04e:cabe:2
root@e2860e0a563d28:/code# dig +short AAAA iad.splashcat-db.internal
fdaa:2:4f68:a7b:1db:726c:5eae:2
fdaa:2:4f68:a7b:1dc:6a63:49ed:2
fdaa:2:4f68:a7b:22c:581d:9e34:2
root@e2860e0a563d28:/code# dig +short AAAA ams.splashcat-db.internal
fdaa:2:4f68:a7b:13b:a04e:cabe:2

The IP fdaa:2:4f68:a7b:1db:726c:5eae:2 in iad is running barman.

1 Like

This part is correct, at least. (It noticed that Amsterdam is closest to Amsterdam, for example.)

Flycast really should be helping you, though, instead of adding 600ms…

How about the following, from an ams web-app machine again?

psql -h "splashcat-db.flycast" -p 5432 -c "select inet_server_addr()" -U postgres

(I get an fdaa:* that’s on the .internal list, in the analogous scenario over here.)

1 Like
root@e2860e0a563d28:/code# echo $FLY_REGION
ams
root@e2860e0a563d28:/code# psql -h "splashcat-db.flycast" -p 5432 -c "select inet_server_addr()" -U postgres
Password for user postgres: 
        inet_server_addr         
---------------------------------
 fdaa:2:4f68:a7b:1dc:6a63:49ed:2
(1 row)

The ip here does match with the primary’s ip, which is expected. I decided to also run it with port 5433 out of curiosity and got an IPv4 address for some reason??

root@e2860e0a563d28:/code# psql -h "splashcat-db.flycast" -p 5433 -c "select inet_server_addr()" -U postgres
Password for user postgres: 
 inet_server_addr 
------------------
 172.19.160.66
(1 row)
1 Like

Oops, you are right, I really did want that 5433 there… I don’t yet know why it gives an IPv4 address in this situation, but I do see that same effect.

Overall, the idea is to try to confirm that it’s connecting to the Amsterdam replica, as opposed to getting sent all the way across the Atlantic. So, an IPv4 address might turn out to be just as useful.

If you SSH in to the various Postgres machines and ip -4 addr show eth0 inside, then perhaps you can distinguish the two continents that way?

(On Debian, ip is in the iproute2 package, and the man page for this sub-command is man ip-address.)

1 Like

Seems like it’s being sent to the correct machine.

root@e82d921f743508:/# ip -4 addr show eth0 
3: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1420 qdisc pfifo_fast state UP group default qlen 1000
    inet 172.19.160.66/29 brd 172.19.160.71 scope global eth0
       valid_lft forever preferred_lft forever
    inet 172.19.160.67/29 brd 172.19.160.71 scope global secondary eth0
       valid_lft forever preferred_lft forever
root@e82d921f743508:/# echo $FLY_REGION
ams

I did run the same command on every machine in the cluster just to make sure that the IPv4 address wasn’t like, something that’s always the same or something. Was worried that maybe it was something weird inside the container because of Fly.io stuff usually being IPv6.

I feel like at this point it has to be something with Fly.io’s networking in ams, especially with flycast. But even using the .internal address above in ams still had a higher latency compared to using .internal in iad. Nevermind, I just reran this and the response time is the same as in iad. So it’s just something with flycast in ams.

root@e2860e0a563d28:/code# time psql -h "splashcat-db.internal" -p 5433 -c "select state, count(*) from pg_stat_activity group by state" -U postgres
 state  | count 
--------+-------
        |     4
 active |     1
 idle   |     8
(3 rows)


real	0m0.048s
user	0m0.031s
sys	0m0.005s
root@e2860e0a563d28:/code# echo $FLY_REGION
ams

I’m guessing the try in the quoted post being slow was either just random luck or it routed it to iad for some reason. Rerunning it several times in a row with ams.splashcat-db.internal is pretty consistently around 0.05s.

1 Like

Can anyone from Fly.io provide some insight on this? I’m pretty sure this is an issue with flycast and is seemingly outside my control.

1 Like

Once again bumping this hoping to get some insight from someone at Fly.

This may have escaped notice way down here…

:leaves: :leaves: :leaves:

I think a top-level re-post would be fair, emphasizing the Flycast aspect.

(Odds are you’re not the only one affected.)

1 Like

Made a thread :slightly_smiling_face: 800ms intra-region Flycast latency (Amsterdam)

Thanks for all the help with this!!

dumb rambling that i wrote for some reason

Sorry for not really saying thanks sooner >.< I feel like I struggle a lot with knowing when to say it exactly and like, I worry a lot that it’d be weird if I said thanks in like every single post. yeah i don’t know why i wrote this but i did >.< but like really thanks so much for the help with this!!

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.