LiteFS Cloud - backups and point in time restores for LiteFS

Exciting news! We’re about to launch a completely new product, LiteFS Cloud, and we have a preview for you to test out! LiteFS Cloud gives you painless backups and point-in-time restores for your LiteFS databases.

One difficult part about running a LiteFS database cluster in production, is figuring out a disaster recovery plan. Until now, you had to build your own: take regular snapshots, store them somewhere, figure out a retention policy, that sort of thing. This also means you can only restore from a point in time when you happen to have taken a snapshot, and you may want to limit how frequently you snapshot for cost reasons.

If you’re already using LiteFS (whether you’re running it on Fly or somewhere else), you can add in LiteFS cloud and we’ll manage your backups. You’ll be able to restore to any point in time (with 5 minute granularity) in the last 30 days.

For now, LiteFS Cloud is only available in the Fly Dashboard (not via flyctl). Here’s what you can do to try it out:

  • Upgrade your LiteFS to version 0.5.0 (or greater)
  • Create a LiteFS Cloud cluster (in the Fly.io dashboard, LiteFS Cloud section)
  • Make the LiteFS Cloud auth token available to your LiteFS (via a secret, if you’re using Fly.io for your app)

To find out more, check out the updated LiteFS docs.

You can also do restores from the dashboard.

Caveats

This is a super new thing, and there are still some rough edges. Here are some things you should watch out for:

  • Right now, LiteFS only checks in with LiteFS Cloud on write, so you’ll need to write to the database before you see your snapshot restored. For production databases which tend to have frequent writes, this is not a big issue, but if you’re testing it out, it might be confusing!
  • We haven’t decided on pricing yet. This is a free preview for you to test, but we’ll be charging for it when we launch it for real. If you have opinions on pricing, we’d love to hear them!
16 Likes

LiteFS Cloud is what we’re calling it? Neat.

For control-plane related workloads (low writes, high reads; our data-plane workloads look different), we are currently experimenting with Cloudflare D1 because the free-tier (25billion 4KB reads + 50million 1KB writes; ref) is essentially enough for 100x our current workload. Beyond the free-tier, you get 100million 4KB reads / 1million 1KB writes for $1. So, something to keep in mind :slight_smile:

Though, I imagine LiteFS can support plugins like sqlite-vss relatively easily?

If Cloudflare D1 is working well for your use case, we’re probably not going to beat free! But obviously Cloudflare D1 has some limitations that LiteFS doesn’t. :grin:

Yeah, definitely, this is something you can do with LiteFS! It’s just vanilla sqlite, so in general, you can just do normal sqlite things with it (like using plugins). Disclaimer: I haven’t actually tried it with sqlite-vss, but I don’t think there will be any issues.

1 Like

I haven’t used D1 personally but it sounds like an interesting platform. Their platform takes a different approach than LiteFS as it’s more like hosted SQLite whereas LiteFS Cloud is more like managed storage.

That has pros & cons. One benefit of LiteFS is that it runs compute on your own node so you can install any extensions you want to as long as they don’t change the file format (e.g. encryption extensions like SEE or SQLCipher).

Another benefit is that, as @darla mentioned, it’s just vanilla SQLite. You can do things like interactive transactions whereas D1 only supports batch transactions.

I think one of the main upsides to D1 is that it’s a fully managed platform. You lose control over lower-level parts of the database but in exchange you don’t have to worry about the lower-level parts. :slight_smile:

How has your performance been with D1? I believe they’re built on Durable Objects and there’s some limitations with serial write performance. I’ve been wanting to dig into the platform a bit more once they move out of alpha.

3 Likes

LiteFS Cloud has a different cost model since we’re not really serving any compute. It’ll likely be cheaper if you have heavy read volume, since reads don’t affect LiteFS Cloud at all. That’s just local compute. I think it’s a pretty neat model, myself.

5 Likes

Transactions (writes) particularly were dismal before; enough for us to stop experimenting with it. Their latest storage pivot, which has noticably faster reads but spotty writes, around D1 isn’t based on Durable Objects, however (which are pretty neat themselves, but apparently a primary source of the aforementioned slowness). Kenton’s tweets point out that D1’s new storage layer might be similar to that of LiteFS’ (ref). I’d reckon they are operating at the device layer instead of VFS…

Concur.

2 Likes

Would this remove the need to mount a volume? I notice it’s still in the docs but can’t see much about it. My understanding was that it prevented data loss if all the nodes went down; is there more to it?

For now, you should continue to use volumes – while there actually may be a way to get this working, the current LiteFS Cloud is more for disaster recovery, it’s not exactly intended to be your database’s regular persistent storage. Volumes make more sense for that.

However, we are working on some features to support ephemeral environments. We’re thinking about this for apps running on serverless platforms (Vercel, Lambda, etc.), but those same features should also enable deploying on Fly without using volumes. Keep an eye out for new posts here, we’re expecting to announce some interesting stuff in the coming weeks/months. :grin:

3 Likes

Sounds great, thanks!

1 Like

Am looking into trying out LiteFS for my app, which is currently using Fly Postgres. With Fly Postgres, I have middleware to replay any non-GET requests to my primary region and to catch ReadOnlySqlTransaction errors, which also get replayed in my primary region. From my understanding, LiteFS’s proxy will only handle the replays for non-GET requests? Is there a way I can continue catching and replaying read only errors? I could see GET requests causing database writes for stuff like extending a session which worries me a bit.

Hm, looking into this more and am also not sure how to handle my Celery worker stuff.

Seems like I might be able to run it as a separate process in the VM with this? Running database migrations · Fly Docs

You’re exactly right about how LiteFS proxy works. But, yes, I think you can keep your current middleware implementation! You’ll have to catch a slightly different exception, but generally it should work.

I’m assuming you’re using Python because you mentioned Celery – I just tested it out (attempted write to a LiteFS read-only replica), and I got this, which is I assume what you’ll want to catch:

sqlite3.OperationalError: attempt to write a readonly database

This a pretty generic exception - if it were me I’d probably also check the text for “readonly”.

Yeah, this is something of a hassle. Some of the features we’re working on may make this easier, but we aren’t that close to releasing them. For now, I think running the workers as another process in the primary node makes sense. You could probably do it similarly to migrations like you’re suggesting – run a script that will start the celery worker in another process. (Note that it has to be run in another process, otherwise your regular web server will never start. It waits til the first process finishes before going to the second.)

Gotcha, seems like this’ll work! Can probably just change my Postgres middleware a bit and make it set a transaction ID cookie like the proxy does.

Not very familiar with bash stuff, but I imagine there’s a way I could write a little script that runs in another thread and checks every second if the current node is the primary and if it is then start Celery and if it gets demoted then it stops Celery if it’s running.

I think that would work!

You could probably also write that script in Python instead of bash. (I personally am not too bad with bash stuff, but I often write scripts in Python because they’re just more readable/maintainable.)

Another option would be to switch to static lease management (instead of using consul), and then the primary node won’t change – then you won’t have to stop/start the worker, it will always run on the same node. (static lease management docs, and example litefs.yml using static lease for reference – the docs are a bit light, but hopefully together with the example it is clear enough).

1 Like

Finally got around to looking into this! Currently kinda just throwing something together with a little python script that I let copilot write most of, but a bit stuck with running it.

# every second, check if the `/litefs/.primary` file exists. If it does not,
# then the current node is the primary and Celery should be started, if not
# already running. If the file does exist, a different node is the primary
# and Celery should be stopped if it is currently running.
import os
import time

celery_running = False
while True:
    if os.path.exists('/litefs/.primary'):
        if celery_running:
            print('Stopping Celery')
            os.system('pkill -f "celery worker"')
            celery_running = False
    else:
        if not celery_running:
            print('Starting Celery')
            os.system('celery -A splashcat worker -l INFO -B &')
            celery_running = True
    time.sleep(1)
exec:
  # Only run migrations on candidate nodes.
  - cmd: "poetry run python manage.py migrate"
    if-candidate: true

  - cmd: "poetry run python scripts/run-celery-when-primary.py &"
    if-candidate: true

  # Then run the application server on all nodes.
  - cmd: "daphne -b 0.0.0.0 splashcat.asgi:application"

It seems like it gets stuck with the little python script despite the &, which I saw suggested on a stack overflow question about running stuff non-blocking. Does LiteFS do something special that makes this not work?

LiteFS is running the command and monitoring the process waiting for it to exit before moving on to the next one – it’s not running in a shell context, if that makes sense (& is a feature of the shell, rather than something about how processes are managed). LiteFS is written in Go, but you can see similar behavior in Python when using subprocess.run (instead of os.system).

I think the easiest solution here is to add a wrapper script, which starts the real script and then exits, and use that one in the litefs.yml. I just mean just a two-line script something like:

import os
os.system('python scripts/run-celery-when-primary.py &')
2 Likes