LiteFS GET request containing writes + LiteFS with multiple processes

Hi! I’m looking to transition from postgres in my ongoing project, and so I was thinking I could leverage LiteFS to distribute my sqlite db.

GET request containing writes

The issue I’m having is that I have a GET request which SELECTs data to see if it exists, and if it doesn’t, downloads it from an external source and then INSERTs it, after which it returns it. Let’s call it GET /resources/:id. The second time this call is invoked, the data is expected to have been persisted, and so it is returned the aforementioned SELECT.

As per my understanding, because I’m writing inside of a GET, I’m a bit screwed when it comes to fly-replay and proxying writes to the primary node. I don’t think the consumer/caller of this API endpoint should be concerned about this, so I’m at a loss on how to solve this.

Can I make this work with fly-replay, http-proxy or something else, or am I screwed when it comes to distributing my db in this way?

LiteFS with multiple processes

I’m running “multiple processes” with a config like below. Isn’t this a problem when it comes to running the mounted litefs volume? Meaning I’d need some kind of supervisor process to make sure the db is available before those entrypoints/commands are run?

[processes]
  backend = "npm run backend:start"
  worker = "npm run worker:start"

[[vm]]
  cpu_kind = "shared"
  cpus = 1
  memory_mb = 256
  processes = ["backend"]

[[vm]]
  cpu_kind = "shared"
  cpus = 1
  memory_mb = 256
  processes = ["worker"]

As you can hear/see, I’m a bit confused in all of this. Any guidance would be greatly helpful. Thanks!

Hi… No, LiteFS is very flexible, and the answer is rarely “you’re screwed”—but rather “you need to do a little more work.”

In this case, you can emit the fly-replay header from within your own application logic, right before you would have started downloading from the external source. (You can get the location of the current primary in /litefs/.primary.)

Alternatively, it might be easier, when you’re just getting started, to do a client-side redirect to a POST instead. This would cost you an extra trip to the client, but it sounds like these won’t be very frequent.

litefs mount is this supervisor; it won’t run any of the other commands that you pass it until it’s contacted the primary and gotten all of the latest snapshots from there, etc.

(This older Rails thread might give you an idea of what’s involved a bit more concretely.)

Hope this helps a little!

1 Like

Thanks for the assistance!

The flow you’re proposing (no fly-replay)

  1. Client calls GET /resources/:id
  2. Backend replicated somewhere other than the primary accepts this request
  3. Backend reads from db, and understands that this data doesn’t exist in the current read replica, and since it’s a read replica it can’t continue to fetch + write the data
  4. Backend returns a 3xx to a POST /resources:id, which will commence the download of external data + INSERT afterwards
  5. After the INSERT, this data will now be returned from the POST to the client
  6. Subsequent reads will follow the “ordinary” GET flow, meaning the read replica has this data(?)

The flow you’re proposing (fly-replay)

The difference here, is that instead of returning the 3xx redirect, we call the primary backend node from the replica backend node, and receive the response there?

Is this roughly the flow we’re talking about?

Regarding supervisor

So litefs mount will be the Dockerfile entrypoint, and when the [processes] are run, they will run after this entrypoint, as the command? And how does the litefs.yml exec[0].cmd fit into this?

I think this part is still a bit confusing to me. In the rails post you linked to, there is a rails server command, but not a server + a worker.

Thanks again!

Unfortunately, I was wrong about the POST facet. I keep forgetting that 3xx redirects can do POSTGET but not vice versa; sorry about that!

After researching it a little more, it looks like a client-side redirect can instead be done by reflecting it off of the undocumented always-forward layer:

proxy:
  always-forward:
    - "/really-write/*"

Then, your replica would do a temporary GET redirect to /really-write/resources/:id, which the LiteFS proxy would, after the client re-did the request, forward to the primary (even though it was still a GET).

The main caveat here—which I was hoping to avoid with the POST—is that the LiteFS proxy
will not update the TXID cookie. So, you would need to find some other way of ensuring
read-your-writes consistency.

(Arguably, this is an oversight in the always-forward implementation. The original motivation was a write within a GET.)

One of the neat things about this is that it’s actually the Fly edge proxy that does the calling of the primary. Your replica just has to include the Fly-Replay header in a[n otherwise mostly arbitrary] response…

Fly-Replay: instance=ab87ef0077cdff  # from /litefs/.primary

The edge proxy itself intercepts this and then sends the client’s original request to the primary.

(And the client gets the response from the primary—via the latter’s LiteFS proxy and the edge proxy—without the replica being further involved.)

This way would also require you to do the work of ensuring read-your-writes for your users.

Yeah, that’s a fair question… Reading the source code, it says the positional arguments to litefs mount “Override” the entire exec section of the .yml file, but I don’t see that documented anywhere.

In your case, I think you would want to use the exec section for your backend process group, and then override for the worker:

[processes]
  backend = ""  # use litefs.yml's exec.
  worker  = "-- npm run worker:start"

(I did need the very odd-looking -- when I tried something analogous just now, myself. Otherwise, litefs mount complains about “too many arguments”.)

This way, you can still have the migrations, on the backend process.

1 Like

Regarding the always-forward idea
My backend, which figures out that the data doesn’t exist, returns a redirect to some endpoint listed in this always-forward config in litefs.yml, which will have the client communicating with the primary, which can write.

Why do I need to ensure read-your-writes consistency in this flow? The primary will have that consistency after having written the data, since the primary should?

Regarding the fly-reply idea
To me this sounds functionally identical to the always-forward idea? Still I don’t see why the read-your-writes consistency is needed, given that the rest of the communication (post return/redirect) is happening between the client ↔ primary.

Regarding the litefs cmd/exec
So if I set it up the way you’re mentioning, then I can rely on the following sequence:

  1. litefs mount runs
  2. Migrations runs (probably some ORM command, I’m using drizzle cli currently). I guess this command should be the one in exec[0].cmd? Though, I read it expects it to be a long running command, and migrations aren’t long running. Do I need to bundle this step as part of the backend process, meaning running it prior to running the webserver (express in my case)?
  3. Backend process starts given migrations having been run successfully
  4. Worker process starts given migrations having been run successfully

Thanks again for the assistance. I think things are slowly starting to make sense.

It’s not automatic that you do need to do anything extra, but it does come up a lot…

For example, if your /resources/:id page contains a link to a /subresources/:subid that you also just loaded from the external source. Then a click on the latter would go to the replica and (possibly) fail.

[Due to outrunning the LiteFS updates.]

The main point is to think it through all the way…

It’s true that this is the same overall idea, but it’s a different mechanism. Here you save a trip all the way back to the client.

Yes, with the if-candidate conditional.

Your backend is then the second cmd, and that one is long-running.

The docs imply that this will happen, I think, but generally you have to be prepared for some schema skew.

It might be wise to err on the side of safety…

1 Like

Ok, so I’m back into this topic again after a couple of days of no time.

I’m going to try out the fly-replay approach, to learn and also to avoid the extra roundtrip to the client. I’ll tackle the read-your-writes consistency whenever that comes up.

The first issue I’m encountering (haven’t tried yet, just by looking at docs and the code I’ve written thus far) is that the contents of the /litefs/.primary file doesn’t seem to correspond with the instance=<id> of the Fly-Replay header that you mentioned. It gives you the hostname instead, as documented here. How can I use that?

In addition, I’m struggling to understand the processes and mounts sections of the Fly configuration. With a configuration like the following (combining mounts with what you’ve laid out for me):

[mounts]
  source = 'litefs'
  destination = '/var/lib/litefs'
  processes = ["backend", "worker"]

[processes]
  backend = ""  # use litefs.yml's exec.
  worker  = "-- npm run worker:start"

How can I actually refer to the "backend" process here inside of the [mounts] section? Since it’s now run by the litefs exec? And in addition, will these actually mount the same volume? Related discussion over at Mounts per process group for Apps v2 - #8 by natejonesrisekit

I’m also realising that if I replicate my application this way, then the read replica worker can’t write either, which it needs to do. So maybe I’d be better off running the worker process together with the primary backend node, and when replicating, that’ll only be the backend/api, since that’s what’s needed to be closer to the users. Then again, LiteFS will promote/demote a machine as primary depending on various factors, so how would I handle that situation? Fly-Replay wouldn’t work as the worker logic is not based off of an incoming request.

Maybe I can use hivemind or some other supervisor as described here, which will run everything inside of one single process. That way I can mount one volume and have it accessible from both the backend and worker processes. I still wouldn’t be able to put another node somewhere else in the world, as the worker would still try to write on the non-primary nodes…

I feel like I’m digging my own grave trying to figure this out. Postgres was easier :smiley:

Any ideas?

for that model (workers act on the databaseand here’s also a primary web node acting on the database) you’d ideally have both on the same machine (otherwise you’ll spend a lot of time acquiring write locks and completely obliterate the advantages of an in-process database)

so hivemind (or similar) is how you’d approach that. one way to do it would be to not run the workers on geographically distributed machines and only run the workers when they’re elected as the primary. last I went looking this was mostly left as an exercise to the reader and it’s something I haven’t tackled because a single machine serves up my content just fine

2 Likes

Thanks for the reply!

So if I have:

  • 1 Fly app running hivemind
  • hivemind running both processes (backend, worker)

Then I can fly-replay requests made to “backend” which need to write, and I can opt to run the “worker” only when that machine is deemed the primary, and since there will always be an elected primary, there will always be a running worker.

How would I setup the worker only to run “when they’re elected as the primary”? Would that happen as part of hivemind, because I assume that’s a decision for that supervisor?

I think this setup would work pretty well as I spin up more backend replicas closer to users (the whole point).

EDIT
And isn’t this complicated by the dynamic nature of the election itself? Will the Dockerfile entrypoint => litefs.yml.cmd[0].exec => hivemind run every time this happens? Because it sounds like this decision needs to happen here. Seems like this is related to consul?

Regarding the earlier question…

Those are both hostnames:

$ fly ssh console -s -a litefs-using-app  # select primary
# hostname
fb88cc45ff3029
# exit

$ fly ssh console -s -a litefs-using-app  # select replica
# cat /litefs/.primary
fb88cc45ff3029
# curl -i -X POST 'http://localhost:8080/'
HTTP/1.1 200 OK
Fly-Replay: instance=fb88cc45ff3029
Date: Thu, 02 May 2024 16:11:19 GMT
Content-Length: 0

Regarding overmind

Glancing at its documentation, it looks like it has a Unix-domain socket that you can use to request that it start or stop specific processes—while others stay running. I think you could thus write a script of your own that monitored the LiteFS event stream and then issued an overmind stop worker when it saw "isPrimary": false, etc.

I think you would also want to make the last litefs.yml exec be another small script that compared FLY_REGION to PRIMARY_REGION and then, if those differed, started only the backend on that node.

(These I haven’t tried myself, though.)

I think this makes sense. So, to summarise all of this:

  • Add a supervisor process such as hivemind/overmind
  • This supervisor is the glue between litefs mount and backend and worker processes
  • Whenever a deployment happens, if the node is the primary node after litefs mount has run (as known by the litefs exec), then we run the migrations, after which we run a script which starts the supervisor. This script will always start backend but will only start worker if the node is the primary
  • Whenever a node gets elected as primary after a deployment, this is broadcasted as a primaryChange as per the LiteFS event stream, and the supervisor can act accordingly by killing/starting the worker on that node. This is done by a script, which is also run by the supervisor? Or where does this fit?

Thoughts?

I feel like I’m quite far away from my comfortable frontend devving at this point :smiley:

1 Like

So here’s my update after following the advice in this thread. Might be useful as a reference for other people working to make a similar setup function.

It runs one process group, “app”, on one VM/Machine, which runs LiteFS, which in turn runs the overmind supervisor, which in turns runs the backend, worker and poller(poll whether to start or stop the worker process based on primary status) processes.

fly.toml

...
[mounts]
  source = 'litefs'
  destination = '/var/lib/litefs'

[vm]
  cpu_kind = "shared"
  cpus = 1
  memory_mb = 512 # Shared by backend,worker,poller
  processes = ["app"]

Dockerfile

# The process starts with litefs mount, which will afterwards boot overmind, which will boot individual processes
ENTRYPOINT litefs mount -config deployments/litefs.yml

litefs.yml

exec:
  - cmd: "npm run db:migrate:deployed"
    if-candidate: true

  # Initiate the long running process supervisor "overmind" to start the application
  # This path is relative to this file
  - cmd: "./deployments/scripts/start.sh"

start.sh

#!/bin/sh

if [ "$PRIMARY_REGION" == "$FLY_REGION" ]; then
    # In the primary region, we start both the backend, worker and poller processes
    overmind start -f deployments/Procfile -l backend,worker,poller
else
    # In the secondary region, we only start the backend and poller process
    overmind start -f deployments/Procfile -l backend,poller
fi

Procfile

backend: npm run backend:start
worker: npm run worker:start

# poll LiteFS to start/stop the worker process based on primary status
poller: ./scripts/poll_litefs_events.sh

poll_litefs_events.sh

#!/bin/sh

# Poll the LiteFS event stream every second, and start/stop the worker process based on the primary status
while true; do
    response=$(curl -s localhost:20202/events | jq -r '.type, .isPrimary')

    # We're looking for a response such as https://fly.io/docs/litefs/events/#primarychange
    # {
    #     "type": "primaryChange",
    #     "data": {
    #         "isPrimary": false,
    #         "hostname": "myPrimaryHost:20202"
    #     }
    # }

    # If we are not the primary anymore, stop the worker process
    if [[ $response == "primaryChange false" ]]; then
        echo "primaryChange is false, stopping worker"
        overmind stop worker
    # If we become the primary, start the worker process
    elif [[ $response == "primaryChange true" ]]; then
        echo "primaryChange is true, starting worker"
        overmind start worker
    fi
    sleep 1
done

package.json

...
// Migrate commands
"db:migrate": "npx tsx src/db/migrate.ts",
"db:migrate:deployed": "SQLITE_PATH=/litefs/db npm run db:migrate"
// Boot commands
"backend:start": "NODE_ENV=production node dist/backend/index.js", // Listens on :8080
"worker:start": "NODE_ENV=production node dist/worker/index.js",

Endpoint handler code

handler.ts (GET /resources/:id)


  // This code is reached if the data needs to be fetched externally and then written to the DB

  // If we're running in deployed environments (dev, prod) we need to use the primary node
  // Since the following logic relies on writing to the database
  if (process.env.NODE_ENV === "production") {
    const primaryNodeLocation = await getPrimaryNodeLocation();

    console.log("Primary node location", primaryNodeLocation);

    // This means we're not reaching this line of code on the primary node,
    // which means we need to replay the request to the primary node
    if (primaryNodeLocation) {
      return res
        .setHeader("Fly-Replay", `instance=${primaryNodeLocation}`)
        .status(200)
        .send();
    }
  }

litefs.ts

import fs from "fs/promises";

import { LITE_FS_PRIMARY_FILE_LOCATION } from "../config";

export async function getPrimaryNodeLocation() {
  try {
    const file = await fs.readFile(LITE_FS_PRIMARY_FILE_LOCATION, "utf8");
    return file.trim();
  } catch (err) {
    console.error(
      "An error occurred while reading LiteFS primary file location",
      err
    );
    return null;
  }
}

There’s nothing in the worker logic that has been adjusted, because I don’t think that should be needed.

I think that covers all of the important parts. I’ve been able to have this work in the primary region. The setup seems a bit finicky though.

Questions:

  • Does anything stand out as problematic to you in this setup?
  • When I run fly m clone to replicate into a new region, it seems to not mount the volume that I already created, with the already written data. Is this a problem? Meaning every replica will have its own volume? It sounds like this will duplicate the data, making it costly in the long run.
  • The poller runs every second, and it looks like it could potentially be picking up tasks from the queue (Redis in my case via node BullMQ) after it has lost its primary status. This will have the tasks fail. I guess this will be retried when the new worker starts up? Sounds like an edge case but still.
  • I expected the DB to be connected through the /var/lib/litefs path, but it seems to be litefs/db. Where is this decided?

Thanks again for all of the assistance.

In addition, it seems like the Fly-Replay approach (after having figured out that the data doesn’t exist in the read replica, and we’re not the primary), is slower than just not having the read replica.

EDIT: This is pretty obvious, that going to the first server before going to the correct one is going to be slower. I just wish there was a way to know in advance that this is a “new resource that isn’t in the DB yet”.

Only after having written the data and said data has been replicated are the read replicas faster.

This write request is crucial for my app unfortunately, so I’m wondering how I can speed it up…

See attached images illustrating this.

Scenario 1


Scenario 2

Scenario 3

Thanks for the diagrams! I see a few millisecond slowdown in the transatlantic replay case (cdgord) myself.

If Scenario 3 isn’t dominant then read replicas may not be the best solution.

(The same would be true with distributed Postgres.)


You might instead want a database that can accept writes at multiple nodes and then gradually reconcile them behind the scenes, well after the clients have disconnected.

That can be a gigantic pain sometimes, or super-easy in other cases. It really depends on the exact details, :dragon:

I think I might be able to serve the freshly fetched content from any of the nodes (primary/read) if the data doesn’t exist, and in that case enqueue a background job which will download and write it, after which it will be served from wherever. This ought to be quicker than replaying the request to the primary like I’m currently doing. It also means I can start using the LiteFS proxy, since my GET now only reads.

  • What’s your thought on this approach?

Remaining questions (still wondering about these)

  • Am I expected to have a new volume created with duplicate data for each read replica? Won’t that be costly?
  • I expected the DB to be connected through the /var/lib/litefs path, but it seems to be litefs/db. Where is this decided?
  • The poller runs every second, and it looks like it could potentially be picking up tasks from the queue (Redis in my case via node BullMQ) after it has lost its primary status. This will have the tasks fail. I guess this will be retried when the new worker starts up? Sounds like an edge case but still.

That sounds pretty good, :black_cat:, if the external source can be relied on to return the same thing.

Machines never share volumes, so you do need multiple. Many people find that they can run 3 LiteFS nodes while staying within the free allowance.

But if you have more than 1GB of data already, then it will cost more, it’s true.

LiteFS replicates the entire database, rather than individual rows, which is both a strength and a weakness.


Tangent: People have asked in the past whether they could just store the non-primary-candidate nodes’ copies on those Machines’ built-in ephemeral root partitions instead. The LiteFS docs say that you “must place this directory on a persistent volume”, and Fly generally wishes that the root partition could be made strictly read-only. Moreover, the exact size and the like are not set in stone. I wouldn’t recommend using it for anything serious, even if it maybe does work at present…


Your Node application should be writing (only) to /litefs/, but LiteFS will be transparently turning those into actions on /var/lib/litefs/. This is decided in litefs.yml, in the fuse and data sections.

FUSE is the mechanism that allows LiteFS to intercept reads from and writes to the /litefs/ filesystem hierarchy.

It is, but you definitely do want retry logic. Jobs can fail even in the single-node case, to be sure.

(Also, the LiteFS event stream is made to be kept open and read line by line, rather than polled. This isn’t super-convenient in bash; I was thinking more of Ruby or Racket.)

Hi again. All of this makes sense, thanks for clarifying.

I’m preparing to launch with LiteFS, and I’m currently running 2 nodes (fly scale count 2) with minimum 1 running (because my cold starts are crazy slow for now, will work on this later), in the primary region of ams (Amsterdam). Because of

litefs.yml

candidate: ${FLY_REGION == PRIMARY_REGION}
  promote: true

I imagine both of my nodes are eligible to become the primary, but only one should be?

When I SSH into both of them with

fly ssh console --machine id

I couldn’t see the /litefs/.primary on neither machine. This tells me something is off, and potentially my worker will write on a read replica? Or are both “primary” in the primary region? How would that work, since they have individual volumes?

Worth noting was that one of those machines were stopped and the other was started. Should I be concerned here? How can I test this?

Typically, you will have at least two that are candidates. That way, one can take over—if the other crashes, needs to be upgraded, etc.

(For best results, min_machines_running = 2, otherwise the second won’t have all the latest, :snowflake:.)

Your future nodes in Chicago or Sydney (or wherever) will not be candidates, though, and hence will never become primary.

No, only one at a time. The lease machinery is what enforces that.

Newly started candidates will ask to be named primary automatically (due to promote: true), so that may have confused the issue.

Try turning off auto-stop temporarily and then retry.

(You should also see various chatter about primary lease acquired, etc., in the logs.)

I managed to setup a test with a minimum of 2 nodes running in the primary region.

Both of them were running, and when I stopped the primary one (running the worker) it took quite a while for the other running machine to get promoted (primaryChange event), like 1 minute I think. Is this expected?

There was no traffic during this time, so I wonder if that was related. Meaning no traffic to the database (writes or reads).

In addition, I keep getting this “You should run two volumes in production” warning. Should I do this even with LiteFS Cloud?

No, that is too slow, by two orders of magnitude… I see handover within one second, even on shared-CPU machines.

Maybe try looking at swap usage; some of your other comments make it sound like your nodes could be short on RAM…

That’s also surprising. If your [mounts] section is still like above, then flyctl shouldn’t have allowed you to deploy two machines without also having two volumes (either preexisting or ones that it made on the spot for you). What do you see in fly vol list?