Client Side Postgres High Availability Connection Tips

To better use the Postgres HA system Fly offers, there are a few changes you can make on the application server configuration side:

The libpq Postgres driver supports a way to specify multiple hosts on the connection query string. The default DATABASE_URL specifies a URL of the form postgresql://db-app.internal:5432/dbname — this will point the application to any Postgres instance in any region, which might land at either a primary or replica. If the connection lands on a replica instance, the HA Proxy running in there will proxy the connection to the primary. Changing the port to 5433 will instead serve the query as a read-only replica wherever it lands.

The default approach works well if the Postgres installation is only running in a single region, where it doesn’t really matter much which instance the application connects to. For multi-region deployments, it would make more sense to change the host to top1.nearest.of.chaos-postgres.internal, which is a new endpoint that allows connecting to an instance in the nearest region, instead of an indeterminate region.

Both the DNS based approach have a disadvantage, though — if your application client caches the DNS call to db-app.internal, connections will be inactive a long time in the event that the connected instance fails. Even if the failover happens quickly, your application will still continue to remember the old IP address and try to connect to it for a while.

For clients running on the libpq driver (and some others that have added support) it’s possible to specify multiple hosts — to directly tell your client the primary and the replica addresses, and ask it to choose a suitable read-only or read/write connection when running. By specifying postgresql://host1:123,host2:456/dbname?target_session_attrs=any, your client will always know alternative hosts that it can connect to in the event of a problem with any one of them. The target_session_attrs property allows you to specify that either read-only or read/write connections are acceptable (any) or that the connection must be writable (read-write). This allows you to configure a replica-ok or primary-only application connection respectively.

When multiple hosts IP addresses are configured directly, the DNS caching problem goes away — libpq can now switch to another host immediately when one goes down. And based on the target_session_attrs, it will cycle through the hosts until it finds one that matches the target criteria.

Because Fly instance IP addresses are always tied to a specific volume, you can pre-configure the full list of possible hosts in your DATABASE_URL beforehand / statically. If an instance goes down and is reset, it will come back online with the same IPv6 addresses because it’s still attached to the same volume.

You can also set this up dynamically during application startup by querying the db-app.internal DNS AAAA name using your language’s equivalent of dig. This provides the list of instance addresses that are running your database, and you can set the target_session_attrs according to whether you’re currently configuring the primary or a read replica.

Looking at a few ideas for different scenarios:

  • You have a single region / multi-instance deployment: configure your DATABASE_URL with all the direct IP addresses pointing at 5432 to always get a primary, with target_session_attrs=read-write. For read-replica-ok connections you can do the same with port 5433 and target_session_attrs=any.
  • You have a multi-region/multi-instance deployment, with the same number of instances in every region (say 2): Call dig top2.nearest.of.chaos-postgres.internal at application startup to get the two closest host IP addresses, and configure them as a multi-host string the same way as above. Either way, setting port 5432 will always proxy connections to the primary, and choosing 5433 will answer queries right there as a read-replica.
    • If you want the ability to survive the unlikely loss of an entire region at the cost of some queries going to the next-farthest region, top4.nearest.of.chaos-postgres.internal should work.