Hi… Yep, there are various official and unofficial recommendations, and they’re all pretty close to what you sketched out above…
The “automatic failover on region outage” that you mentioned is going beyond the implicit minimum† in that older thread, though, and for that you would need primary-candidates in multiple regions.
In my (super-unofficial) view, the “write latency shouldn’t change just because your primary does” aspect is a minor concern, so I would not shy that much away from having two primary-candidates in two regions, like the proposed arn
+ ams
. The main disadvantage here is that it’s inconvenient to keep them both always running (which you do need, as you noted) via the Fly.io platform’s default orchestrator: that software is relatively simple-minded compared to what you’re attempting, and it only thinks in terms of a single region being primary—not a more general collection of nodes or locations.
Briefly, another orchestration nuance to be generally aware of is how read-only nodes behave when the primary is unreachable. At least one user reported loss of read availability, not just write availability, possibly due to the following policy change:
https://community.fly.io/t/postmortem-fly-registry-2023-08-08/14744
This rules out fallback scenarios that people intuitively assume would work.
Not really, although some long-time LiteFS users do recommend having at least one node in a different region, all the same.
(You can browse the Infrastructure Log (“100% fidelity to internal incidents”) to evaluate how often a given region really goes down in a given year.)
(That probability could definitely still be more than your own, specific app could tolerate, of course.)
I don’t run services that have that strict of constraints, myself, but the official docs mention a bound of a “few milliseconds of write availability loss”. That and the following thread are the only potential gotchas that I know of:
(Note that rolling
is the only option when you have volumes, apart from the rather abrupt immediate
mode.)
You can use the all-pairs round-trip times (RTT) tables to get an idea of how much added latency that would be. That’s currently showing 23ms for arn
↔ ams
.
I wouldn’t expect that to cause much problems with replication lag, but it’s always wise to test ahead of time. Also, do keep the differences in bandwidth billing in mind, in general, .
Anyway, this didn’t address all of your questions, but hopefully it still helps a little!
†That is a Postgres thread, strictly speaking, but it shows Fly.io’s overall thinking on this topic, particularly in the final paragraph.