Postgres High Availability single vs multi region (leader/replica)

Hello, whenever you create a high available postgres cluster with flyctl it creates the two initial volumes in the same region.
Is this the recommended setup from fly? Would it be better to have one node in each region? What would be the disadvantages?

For example in 18 Feb 2022 the following incident took place:

We are performing emergency hardware maintenance in LAX. We will be taking disk arrays offline to migrate them to a new, more reliable datacenter.

We expect this migration to take 2-4 hours. The impact on applications will vary:

  1. If you are running redundant Postgres, your database will remain online with a single node running for the duration of the maintenance.
  2. If you are running individual volumes with no redundancy, your app may be unavailable during the migration.
  3. If you are not using volumes in LAX, your application will remain online.

Did users with both volumes in LAX not get affected? If so, how? Do you automatically place volumes for the same app in distinct physical arrays? How was this migration done?

Thank you

1 Like

Good questions! The short answer is: having both volumes in the same region is better because it allows automatic failover. Automatic failover between regions is risky, and not something Postgres is designed for.

Do you automatically place volumes for the same app in distinct physical arrays? How was this migration done?

Yes this is what we do! We did emergency maintenance on one array, users’ volumes on other arrays were unaffected.

When you add a volume to a Fly.io app, we default to requiring a unique “zone”. You can disable this if you want a bunch of cache space and don’t care about redundancy with the --require-unique-zone option on fly volumes create.

Regional outages are rare. We initially designed Fly.io to handle regional outages for stateless workloads. A few years later, we’ve realized that the cost and UX complexity of automatic region failover is not worth it for full stack app developers. You can build environments that failover between regions, though, it’s just not the default way our database and app launch processes work.

1 Like

@kurt is this still the best strategy with v2 apps?

After reading through high-availability-and-global-replication I am still a little confused to what HA means. From what I can tell it seems “default” HA is done by adding a second machine to the same region, what if this region is down or there is issues the proxy preventing anyone from connecting to this region?

Is there any documentation for the best production topology with explanations to how failovers work. It seems a few things could go wrong:

Volume goes down
Machine goes down
Region goes down

I am basically looking to better understand how to have a webapp (rails) and a pg app both running on V2 that is setup the best way possible to handle any potential outages.

Thanks in advance!