Also: Read-replicas don’t protect against human error or malicious events (DROP TABLE), as they propagate instantly. (Although having a delayed replication would work)…
We could set up all this ourselves using something like WAL-G, (here’s a nice tutorial), but that binds many resources - especially to get it right and have recovery tested properly, too.
Still, one should maybe backup to an external provider like AWS S3 for increased resilience.
I think a solid production ready DB setup is a pretty important feature for Fly, because when picking a cloud provider everything lives or dies with the database. The location of the database determines everything else, because even small latencies add up tremendously, when doing READ-WRITE-READ patterns etc.
What’s the current recommendation for disaster recovery? Is there something on the way? What’s your experience with that?
Doh, I had a draft reply all written up for you and never posted it. Here it is!
Delayed read replicas are your best bet, and what we suggest. We don’t do this automatically, but you can configure one cluster to be a delayed read replica of a second cluster. This would make a good doc! Stolon includes some settings for creating clusters that are delayed replicas of other clusters, you can probably run two Fly.io Postgres apps in this config: stolon/standbycluster.md at master · sorintlab/stolon · GitHub
Our long term goal is to “give” Postgres to a company like Supabase that is focused on a really nice dev UX for postgres itself. The launch story would be the same, but their tooling would handle things like point in time restores, forks, etc. You can use them right now with Fly apps, if you want.
The plumbing we build is meant to be general purpose. We’ve made it easy to create Postgres, but probably won’t ship Postgres specific infrastructure. We will be exposing volume snapshot settings soon, though. And potentially incremental volume snapshots.
thank you for your comprehensive answer, that helps us move forward and set up a solid strategy.
I have to mitigate my statement about latency, though, as latency in some regions between AWS and Fly (e.g., fra) seems perfectly sufficient for distributing hosting between providers, in case that’s what someone wants to do.
Could you elaborate on that? Are you suggesting to look at Supabase for the Postgres hosting? What would be the consequence regarding the egress traffic cost? Would love to know more on your long-term strategy about Postgres, if you have any big plans
For us, it’s not an acceptable solution, though, as we have a shorter RPO, i.e., we cannot tolerate potential data loss of worst case 6 hours. Depends on the use case, of course.
We decided to go with AWS RDS in the end as they have PITR with an RPO of 5 minutes, everything fully managed for a reasonable price. Latency is not as low as within the Fly network, but sufficient for us.