Better understanding best practices for HA for both web apps and PG apps

After reading through high-availability-and-global-replication I am still a little confused to what HA means. From what I can tell it seems “default” HA is done by adding a second machine to the same region, what if this region is down or there is issues the proxy preventing anyone from connecting to this region?

Is there any documentation for the best production topology with explanations to how failovers work. It seems a few things could go wrong:

Volume goes down
Machine goes down
Region goes down

I am basically looking to better understand how to have a webapp (rails) and a pg app both running on V2 that is setup the best way possible to handle any potential outages.

Thanks in advance!

Your application is basically two Fly apps, one for Rails and another for Postgres.

Choose two regions you care about, and pick one as the primary region.

  1. Launch a high available Postgres cluster on the primary region and adds replica in the secondary region according to High Availability & Global Replication · Fly Docs
  2. Start at least one machine for your Rails app on each region.

To be clear, considering your comment at Increasing Apps V2 availability - #6 by danwetherald, you don’t need standbys and you won’t get them when deploying your app either.

Just be sure to scale up your Rails app to run in multiple regions.

I am a little confused by this strategy, as this will load balance connections across the primary and secondary region in this case but the PG app will be residing in the secondary region causing slower response times since that the web server will be connecting across regions (back to the primary region) to access the PG server. (This is assuming there is no read replica configuration).

Can you also explain the strategy of having multiple machines running in the same region? More specifically how apps with mounted volumes like PG servers and how web servers without mounts work and when the second machine is used? Are requests load balanced across the two machines? What happens when one fails, how do these fail in general and what causes only one of the two to fail?

I still do not have a clear understanding on how things fail, how they recover, and how to best set the app up to handle the recoveries. It feels there might not be enough abstraction here when it pertains to High Availability. There really doesn’t seem to be a clear path to running a perfect HA app with optimal performance. I appreciate the level of customization to deployments, but at the same time I would also appreciate the ability to simply say I want a HA app that runs in ORD while also being okay with it running in IAD if shit hits the fan :slight_smile:

@dangra - Something I just thought of, is this problem solved with the private network DNS using top2.nearest.of? Meaning that if we have a backup DB running in IAD and ORD is where our primary DB is running, the ORD web apps should connect to the ORD DB 100% of the time when the ORD DB is healthy, correct?

If that is correct, where I am a little still confused / concerned is what about the secondary web servers running in IAD? If we have web servers running in both ORD and IAD, the traffic is probably going to be load balanced across those two regions, meaning that half of the requests (ones routed to IAD) will be a lot slower because they all have to connect to the health primary DB in ORD.

We will of course be using the fly-ruby gem to manage the use of read-replica and write replays, but we would love to better understand exactly how everything is working and what the best practices are for spinning up an app that is resilient to both regional outages, hardware outages and app failures.

@dangra did you happen to have any more thoughts on my comments above? Still not 100% clear what the best way to have a HA rails app deployed on fly using machines.

Thanks again!

Within the context of Postgres, a PG member can fail for a variety of reasons, host failure, faulty disk, OOM, mis-configuration, etc. As long as you’re running a 3 member cluster and the cluster was healthy at the time of the failure, it should issue a failover and recover without issues.

Here is a specific example of a PG deployment:

I think the most confusing part of HA on fly is regional HA - my biggest concern at the moment when hosting on Fly is when a region goes down (does this happen?) I basically want to see the status page show issues in ORD and not have to be worried at all or need to manually fix any issues on a Saturday at 4am.

Basically with this setup, what are the “odds” for a lack of a better term that ORD will fail to the point where none of the ORD machines will be able to respond to both reads and writes without manual intervention by us to migrate away from ORD to IAD?

I should also add, from my understanding, the only time that IAD and LHR in this setup are used is if we have read replication configured in our Rails middleware and we are reading, is that correct?

This is automatically taken care of if we use the default attach DATABASE_URL correct? Which from my understanding will result in: top2.nearest.of DNS record?

I think the most confusing part of HA on fly is regional HA - my biggest concern at the moment when hosting on Fly is when a region goes down (does this happen?) I basically want to see the status page show issues in ORD and not have to be worried at all or need to manually fix any issues on a Saturday at 4am.

Basically with this setup, what are the “odds” for a lack of a better term that ORD will fail to the point where none of the ORD machines will be able to respond to both reads and writes without manual intervention by us to migrate away from ORD to IAD?

It’s not likely to happen, but there’s always a possibility that it can happen.

This is automatically taken care of if we use the default attach DATABASE_URL correct? Which from my understanding will result in: top2.nearest.of DNS record?

This would depend on whether or not you have flycast enabled or not ( it’s enabled by default ) for newer setups.

Flycast should offer better reliability in the event of an outage, as it’ll ensure connections are only routed to visible machines from the context of the fly-proxy. So reads should continue to function in the event ORD is lost, but writes will fail until your primary recovers.

There are some obvious risks in failing over to LHR in the event ORD goes down. The big one being that you may have very little insight in how much data you’d lose by issuing that failover. Whether or not to perform the regional failover would depend heavily on your use-case and how ok you are with potentially significant data-loss.

1 Like

I think this might be the main point of confusion to what would take the cluster down that runs all 3 machines in the same region with the flex image. Can you describe an example that would take all writes down? Maybe a reference to a previous event from the status page?

End of the day, I guess it just feels weird to have all the machines running in the same region to get true HA, maybe we are just over thinking it?

I think this might be the main point of confusion to what would take the cluster down that runs all 3 machines in the same region?

There are a number of possible reasons on how a regional outage could occur. The reasons range from natural disasters to human error. Some searching online will lead you to more examples than I can provide here.

End of the day, I guess it just feels weird to have all the machines running in the same region to get true HA, maybe we are just over thinking it?

It totally depends on what your requirements are and the trade-offs that you’re willing to accept. It might be helpful to familiarize yourself with CAP/PACELC theorem to get some additional insight into what these trade-offs are and why they exist.

If you have a set of requirements that you can share, folks here should be able to make some recommendations for you and whether or not we can meet those requirements.

2 Likes

For what it’s worth, I don’t usually think it’s worth the effort to make cross region HA work. I would guess it’s 100x more likely that one of your disks fail than one of your regions fail. So the “value” of HA within one region is pretty high.

1 Like

@kurt - That makes sense and honestly what I have been trying to wrap my head around. It sounds like we should be able to sleep pretty well at night with the 3+ machines running the pg flex images in a single region. The only reason for machines outside the primary region should be for read replication to increase performance, not HA.

Just curious, do entire regions loose connectivity ever due to issues cased by the proxy or backhaul?

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.