Does Postgres regional failover happen automatically?

shugel · April 24, 2023, 12:34pm

I have a pretty standard Fly setup: HA Postgres in one region, with a read replica in an adjacent region, with app instances to match (2 primary + 1 backup). I have no other volumes.

If my primary region goes down, my secondary region remains available, but I assume it will be in read-only mode: that is, my single read replica isn’t elected as a leader (is this assumption correct?).

Is there a way to configure my cluster to automatically fail over to my backup region, e.g. by running a second read replica in the backup region, so there are at least theoretically enough members to create a HA cluster?

danwetherald · April 25, 2023, 5:45am

I have been trying to better understand this topic as well over here:

https://community.fly.io/t/better-understanding-best-practices-for-ha-for-both-web-apps-and-pg-apps

shaun · April 25, 2023, 1:04pm

If my primary region goes down, my secondary region remains available, but I assume it will be in read-only mode: that is, my single read replica isn’t elected as a leader (is this assumption correct?).

Yep, that is correct. Keep in mind though, if you’re running the flex implementation, you need 3 nodes running within your primary region for HA.

Is there a way to configure my cluster to automatically fail over to my backup region, e.g. by running a second read replica in the backup region, so there are at least theoretically enough members to create a HA cluster?

We do not currently support automatic regional failovers. In the event you need to perform a regional failover on a Nomad setup, you can reference: High Availability & Global Replication · Fly Docs

shugel · April 25, 2023, 2:00pm

I migrated to the V2 version of Postgres last week, so I’m not on Nomad anymore, but I’m not sure whether I’m on flex. I’m running image version flyio/postgres:14.6 (v0.0.38), so maybe it is? How can I confirm this?

In any case, the docs explicitly mention 2 + nodes for HA, not 3. It’s not a problem for me to scale up, but that should be mentioned – Postgres V1 created two nodes and called it HA.

We do not currently support automatic regional failovers. In the event you need to perform a regional failover on a Nomad setup, you can reference: High Availability & Global Replication · Fly Docs

Thanks for clarifying. Are there plans to support it? As I mentioned, I’m no longer on Nomad…

shaun · April 25, 2023, 2:20pm

@shugel Yeah, i’m sorry this is confusing.

Stolon setups run the image: flyio/postgres:*
Flex setups run the image: flyio/postgres-flex:*

Are there plans to support it?

We don’t have any immediate plans. It’s actually a pretty hard problem and the rarity of a regional split/outage makes it a tough to prioritize. At the very least, we do want to simplify the process for users who wish to manually failover into a new region.

danwetherald · April 25, 2023, 3:14pm

@shaun Can you share some documentation on the difference of these two images and the correct way to deploy each image? Also, is there somewhere we can learn more on how these work during failures (both hardware and regional failures).

shaun · April 25, 2023, 3:21pm

The Postgres Flex implementation is the current default setup, assuming you’re running a recent flyctl version.

This post should provide some context:

Also, is there somewhere we can learn more on how these work during failures (both hardware and regional failures).

How this implementation responds during hardware and regional failures is going to depend heavily on your setup.

How we handle fencing can be found here, which may or may not answer some of your questions: postgres-flex/docs/fencing.md at master · fly-apps/postgres-flex · GitHub

If you have any specific scenarios in mind, let me know.

danwetherald · April 25, 2023, 3:26pm

Thanks @shaun - I am actually reading through that topic as we speak.

As for describing some of the scenarios, I have had a topic open here: https://community.fly.io/t/better-understanding-best-practices-for-ha-for-both-web-apps-and-pg-apps that explains some of the situations that I am trying to better understand as well as how to best set everything up for HA.

shaun · April 25, 2023, 3:55pm

So I think some of your questions may be a little too broad, making them challenging to answer in a timely manor. Answering how HA works is tough when the context spans Apps, Proxy, Databases, etc. I would recommend trying to come up with some specific failure cases that you’re worried about or work to define the specific requirements that you’re working with and folks can help make some recommendations.

danwetherald · April 25, 2023, 4:02pm

Hmm I had a few other pretty specific questions in there, I also outlined 3 examples of failures that we would like to feel good about handling. I also have some very specific questions that are aimed at some pretty specific scenarios.

@shaun Should we move this conversation to the thread I referenced?