PostgreSQL database backup & restore

How is the status regarding PostgreSQL db backup & restore. Is there something planned/some ETA for this?

Many managed database services out there offer some kind of PITR besides automatic snapshots.

I discovered volume snapshots but they have a an RPO of 24 hours?

Also: Read-replicas don’t protect against human error or malicious events (DROP TABLE), as they propagate instantly. (Although having a delayed replication would work)…

We could set up all this ourselves using something like WAL-G, (here’s a nice tutorial), but that binds many resources - especially to get it right and have recovery tested properly, too.

Still, one should maybe backup to an external provider like AWS S3 for increased resilience.

I think a solid production ready DB setup is a pretty important feature for Fly, because when picking a cloud provider everything lives or dies with the database. The location of the database determines everything else, because even small latencies add up tremendously, when doing READ-WRITE-READ patterns etc.

What’s the current recommendation for disaster recovery? Is there something on the way? What’s your experience with that?

Thanks in advance!

6 Likes

Doh, I had a draft reply all written up for you and never posted it. Here it is!


Delayed read replicas are your best bet, and what we suggest. We don’t do this automatically, but you can configure one cluster to be a delayed read replica of a second cluster. This would make a good doc! Stolon includes some settings for creating clusters that are delayed replicas of other clusters, you can probably run two Fly.io Postgres apps in this config: stolon/standbycluster.md at master · sorintlab/stolon · GitHub

Our long term goal is to “give” Postgres to a company like Supabase that is focused on a really nice dev UX for postgres itself. The launch story would be the same, but their tooling would handle things like point in time restores, forks, etc. You can use them right now with Fly apps, if you want.

The plumbing we build is meant to be general purpose. We’ve made it easy to create Postgres, but probably won’t ship Postgres specific infrastructure. We will be exposing volume snapshot settings soon, though. And potentially incremental volume snapshots.

2 Likes

Hey Kurt,

thank you for your comprehensive answer, that helps us move forward and set up a solid strategy.

I have to mitigate my statement about latency, though, as latency in some regions between AWS and Fly (e.g., fra) seems perfectly sufficient for distributing hosting between providers, in case that’s what someone wants to do.

Yes, good point! There are many regions with <1ms latency between us and AWS. Generally, if they’re in the same city they work really well.

Are there options for me to set up point-in-time restores for Fly’s Postgres? All of the off-the-shelf tools I see want either to either run an agent on the Postgres server or SSH into it.

Could you elaborate on that? Are you suggesting to look at Supabase for the Postgres hosting? What would be the consequence regarding the egress traffic cost? Would love to know more on your long-term strategy about Postgres, if you have any big plans :pray:

Yes, I would also like to know about this “giving” of postgres. That would surely affect latency between our apps and the DB if the DB is hosted elsewhere? ie. Supabase.

I’m currently evaluating heroku alternatives, so need to know things like this.

@kurt @nickluger, it seems that it’s possible to create volume snapshots manually and hence have a “simple” app to apply a custom backup strategy.

For example, create snapshots every 6 hours and keep snapshots only for the last three months.

That’s if snapshots are not removed automatically.

Yes, that’s possible.

For us, it’s not an acceptable solution, though, as we have a shorter RPO, i.e., we cannot tolerate potential data loss of worst case 6 hours. Depends on the use case, of course.

We decided to go with AWS RDS in the end as they have PITR with an RPO of 5 minutes, everything fully managed for a reasonable price. Latency is not as low as within the Fly network, but sufficient for us.