This is only possible on fly...

tj1 · April 7, 2023, 6:45pm

Yah, how true…hardly anyone posts when something is working.

Haha, no. I don’t have the required brains to be able to use javascript frameworks successfully. Just using Elixir/Phoenix Liveview. There’s a lot of optimistic UI, but there are quite a bit of data changes/calculations that need to be done server-side. Users were previously on desktop, so there’s a certain expectation of speed.

That was my initial plan last August, but after reviewing stolon setup and replication…I had no confidence I could successfully recover from a failure. PG replication has always been a bit of a pain and fly postgres is a layer of abstraction that is not quite managed, so it seemed like an unnecessary risk at time of review compared to the alternative.

So, instead of that, in preparation for moving to fly, I did the following:

moved several gb’s of data out of postgres that were read-only into sqlite
baked these sqlite db’s into the docker image. This made the image about 900MB, but it makes the management much easier.
setup all the elixir nodes in a cluster (libcluster | Hex)
wrote a simple distributed cache to manage data from the master (only took like 4 hours to be honest)

After all these changes, the load on the primary was < 1 qps and even less on writes.

For the rest of the data, I have been waiting on litefs. I wrote a library to send writes to primary as fly-replay can’t be used with liveview. Converted my staging environment a few days ago using the fly multi-tenant consul and just waiting for litefs 0.4 to drop to start working on swapping over production.

Litefs failover is comprehensible and recovery is straightforward. The only tricky bit is consul and whether to use my own cluster or the multi-tenant one provided by fly. (Aside: it would be nice to not lock replies after 7 days as then posts can be updated with relevant information. Would be handy to update Small consul cluster + litefs failover with the setup. )

@benbjohnson has said litefs can handle on the order of 10’s of writes per second due to the fuse overhead. I thought it would be more since sqlite can handle between 4k to 40k on a single thread, but even 10’s of writes per second is fine.

In my testing, litefs has performed quite well at 0.3 with the consul lease. On deploys, writes are disabled for less than a few seconds as failover happens. You may need more features, but I have been rolling my own backups (for pg and now sqlite) and don’t need the write proxy that is available in 0.4. It obviously depends on your application and your data patterns, but I’ve been using it in production for read-only data for 1+ years and it is has performed better than postgres did.

So my recommendation is to take a serious look at sqlite+litefs rather than trying to do multi-region postgres.