I deployed a small app to test LiteFS functionality. I’ll share the dockerfile, yaml, and toml files in my next comment.
The deployment was successful. I was able to scale the application by adding another machine in the same region and cloning the main machine to a different region - both worked properly.
I then began turning machines on and off to understand how LiteFS operates. My understanding is that LiteFS acts as a WAL journal mode for SQLite distributed across machines, where only the primary machine can perform writes. Communication between machines happens through fly-replay under the hood.
First, I tested the primary machine by turning it on and off to verify the volume was working correctly - it was.
Next, I scaled the application by cloning to another region using fly -m clone ... and added another machine to the same region with fly scale count 2. Then I started testing:
With only the primary machine running: I could perform both reads and writes without issues
With the primary machine turned off: I could still perform read operations without problems
After turning the primary machine back on: I could no longer write to the database and got the error attempt to write a readonly database
I then tried deleting the primary machine to see if other machines (like the cloned one) would assume the primary role - they did not
This means that my primary machine needs to run 24/7, and once it’s turned off, database writes become impossible even after restarting it?
fly launch --no-deploy
fly consul attach
fly volumes create litefs --size 1
fly deploy
# then scaling
fly scale count 2
fly m clone --select --region sjc
# Note: After deployment it was possible to clone the original first machine,
# every time I tried to "scale" to another region or clone other than the
# first machine got the error :
# Error: failed to launch VM: insufficient resources to create new machine with existing volume
Hi… LiteFS really is fun to play with, , but there are some details that the official documentation either doesn’t emphasize or leaves entirely to the reader to infer from their existing distributed systems knowledge…
Starting with the easier one, the error you were seeing is probably because you have internal_port configured incorrectly: traffic was bypassing the Fly-Replay mechanism completely. (Since the LiteFS proxy was out of the loop.)
The following older post has an explicit table showing how things are supposed to match:
This is actually mainly through the .internal network (a.k.a. 6PN)—but optionally also occasionally via Fly-Replay, when redirecting incoming POSTs to the primary, etc.
You do want something like that, but for different reasons. One of the things that was left to deduction is that you should always have two primary-candidates running at (almost) all times. (I.e., min_machines_running = 2.) This makes sense if you consider what happens if the existing primary fails but the replacement candidate has been asleep for the past 3 months…
As a final tip, always look at the logs (fly logs) and event stream when doing experiments with LiteFS handovers; those really tend to clear the mists of who currently has what baton, etc.
Okay so two main issues I ran into. First, the fly launch command keeps reverting internal_port back to 3000, which is annoying. You have to manually fix it to 8080 in the fly.toml file before you deploy.
Second mistake was consule lease candidates. I had it set to only use lax region, but you can just set candidate to true and let Consul pick the region automatically. If the machines on primary region fail, now can choose a machine for other region as primary - no more error: attempt to write a readonly database.
Here’s what changed in the config:
# Before - only lax region could be primary region:
lease:
type: "consul"
advertise-url: "http://${HOSTNAME}.vm.${FLY_APP_NAME}.internal:20202"
candidate: ${FLY_REGION == PRIMARY_REGION}
promote: true
# After - any region can be primary:
lease:
type: "consul"
advertise-url: "http://${HOSTNAME}.vm.${FLY_APP_NAME}.internal:20202"
candidate: true
promote: true
Fly steps:
fly launch --no-deploy
fly consul attach
fly volumes create litefs --size 1 -r lax
# Important - fly launch resets internal_port to 3000,
# so you gotta manually change it back to 8080 in fly.toml
# under [http_service] before the next step.
fly deploy --config [you_app].toml
And that’s it. LiteFS working now.
if you have any suggestions on how to improve the dockerfile let me know but the one on the example works fine
PD: this is the most up to date tutorial on how to start with litefs lol