I have a pretty standard Fly setup: HA Postgres in one region, with a read replica in an adjacent region, with app instances to match (2 primary + 1 backup). I have no other volumes.
If my primary region goes down, my secondary region remains available, but I assume it will be in read-only mode: that is, my single read replica isn’t elected as a leader (is this assumption correct?).
Is there a way to configure my cluster to automatically fail over to my backup region, e.g. by running a second read replica in the backup region, so there are at least theoretically enough members to create a HA cluster?
If my primary region goes down, my secondary region remains available, but I assume it will be in read-only mode: that is, my single read replica isn’t elected as a leader (is this assumption correct?).
Yep, that is correct. Keep in mind though, if you’re running the flex implementation, you need 3 nodes running within your primary region for HA.
Is there a way to configure my cluster to automatically fail over to my backup region, e.g. by running a second read replica in the backup region, so there are at least theoretically enough members to create a HA cluster?
We do not currently support automatic regional failovers. In the event you need to perform a regional failover on a Nomad setup, you can reference: High Availability & Global Replication · Fly Docs
I migrated to the V2 version of Postgres last week, so I’m not on Nomad anymore, but I’m not sure whether I’m on flex. I’m running image version flyio/postgres:14.6 (v0.0.38), so maybe it is? How can I confirm this?
In any case, the docs explicitly mention 2 + nodes for HA, not 3. It’s not a problem for me to scale up, but that should be mentioned – Postgres V1 created two nodes and called it HA.
We do not currently support automatic regional failovers. In the event you need to perform a regional failover on a Nomad setup, you can reference: High Availability & Global Replication · Fly Docs
Thanks for clarifying. Are there plans to support it? As I mentioned, I’m no longer on Nomad…
Stolon setups run the image: flyio/postgres:*
Flex setups run the image: flyio/postgres-flex:*
Are there plans to support it?
We don’t have any immediate plans. It’s actually a pretty hard problem and the rarity of a regional split/outage makes it a tough to prioritize. At the very least, we do want to simplify the process for users who wish to manually failover into a new region.
@shaun Can you share some documentation on the difference of these two images and the correct way to deploy each image? Also, is there somewhere we can learn more on how these work during failures (both hardware and regional failures).
So I think some of your questions may be a little too broad, making them challenging to answer in a timely manor. Answering how HA works is tough when the context spans Apps, Proxy, Databases, etc. I would recommend trying to come up with some specific failure cases that you’re worried about or work to define the specific requirements that you’re working with and folks can help make some recommendations.
Hmm I had a few other pretty specific questions in there, I also outlined 3 examples of failures that we would like to feel good about handling. I also have some very specific questions that are aimed at some pretty specific scenarios.
@shaun Should we move this conversation to the thread I referenced?