Hi,
we are seeing timeouts or slow connections from FLY machines in some regions to our self–hosted PostgreSQL instance (e.g. SIN → hetzner_FRA). Traffic goes through fly-wireguard VPN.
The query we issue is quite big: 30 MB data returned and 5-10 secs execution time.
“good” regions complete the query in 7s
”broken” regions take 30s or way more
Do you recommend and special connection options (pg driver / network stack) to mitigate that problem?
Any help is highly appreaciated.
For regions as far as SIN → FRA on another provider, slower / more unstable connection is, to be frank, somewhat expected, since all traffic ultimately goes through the public Internet. To determine if the Internet is at fault here, you may try a mtr from inside your Fly machines in SIN to the public IP (not the Wireguard VPN IP) of the Hetzner FRA instance.
If you are connecting to the Fly 6PN network using Fly’s wireguard gateways, keep in mind that they are more geared towards one-off usage to access your production environment from e.g. dev laptops, instead of high-traffic production purposes. Every connection through the Wireguard interface has an extra hop in the middle that might limit your throughput. We’d recommend a peer-to-peer VPN such as Tailscale for this purpose instead, as it should be able to establish a direct connection between Hetzner and your Fly machines, instead of through an extra gateway.
Since your PostgreSQL instance acts as the server (the side that sends more data) of the TCP connections here, there are some tricks you can use to optimize throughput on lossy / long-distance connections, such as selecting a better TCP congestion control algorithm. At Fly, we configure all our servers to use BBR by default.
From an architectural point of view, I wonder how much of your app depends on the database instance in FRA? If it is only a subset of the endpoints, maybe it would be better if you used fly-replay to transparently redirect only those endpoints to a Fly region closer to Hetzner FRA, instead of directly connecting to the database from everywhere. PostgreSQL generally works best when network latency is low. Conversely, if your app hard-depends on the database for most requests, running more instances in more regions without scaling out the database itself may not actually be that helpful, even with highly optimized TCP connections to the faraway PostgreSQL instance.
Regarding traffic FROM the fly.io machines .. can we assume that from one organization the IPv6 prefixes are known and constant? I.e. is there a way to whitelist traffic from any fly machine of our org in the firewall?
Unfortunately, no. If you need fixed IPs, the only solution currently is the per-machine egress IP addresses, but it comes with some limitations due to them being bound to machines (for example, bluegreen deployments won’t work nicely). We’re currently working on egress IPs that are scoped to apps instead of machines, but I don’t have an ETA on when that will be generally available.