I’m trying to figure out why this is happening. Will report back with more findings as I’m nearing production. Worth mentioning is that I was monitoring the LiteFS event stream log as this was not happening as quickly as it should have.
Below you can see the Grafana memory graph. Seems to be fine, right? I believe the graph should cover Litefs, backend, worker and poller processes.
You see handover that quickly? That’s interesting. Maybe something is definitely off then, and it’s not memory. And how would I check swap usage?
In case you haven’t noticed I’m quite new to this domain, so I appreciate the extra guidance.
Regarding the volume, I think that warning came when I created it from command line and then attached it actually. So having two machines solves this problem? I thought the error was about having two volumes attached to a single machine.
(The swap lines are configured to be drawn as negative, for some reason.)
Swap could be relevant if the parts of LiteFS that do cluster management were paged out during initial boot, which would mean that they would then have to be read back from disk on handover.
[“Paging” is a technique for using disk to compensate for shortages of RAM, so it’s admittedly not a prime suspect now, given your other graph above.]
What does CPU utilization look like? (Instance Load and CPU)
No, that should be impossible. Assuming that the warning you saw was…
Warning! Every volume is pinned to a specific physical host. You
should create two or more volumes per application.
What Fly is trying to protect people from is the situation of having all their data dependent on the survival of only one volume, since that is a risk of total data loss on this platform. (Not just LiteFS.)
(Indeed, pending commits have descriptions like “make single volume warnings even scarier,” .)
No worries! All of us on the forum are learners, one way or another…
And when I now read the docs about lease management, it seems to be configured as:
advertise-url: "http://$HOSTNAME:20202"
Could this be related to the issue?
For reference, my full config is:
# The fuse section describes settings for the FUSE file system. This file system
# is used as a thin layer between the SQLite client in your application and the
# storage on disk. It intercepts disk writes to determine transaction boundaries
# so that those transactions can be saved and shipped to replicas.
fuse:
dir: "/litefs"
# The data section describes settings for the internal LiteFS storage. We'll
# mount a volume to the data directory so it can be persisted across restarts.
# However, this data should not be accessed directly by the user application.
data:
dir: "/var/lib/litefs"
# This flag ensure that LiteFS continues to run if there is an issue on starup.
# It makes it easy to ssh in and debug any issues you might be having rather
# than continually restarting on initialization failure.
exit-on-error: false
# This section defines settings for the option HTTP proxy.
# This proxy can handle primary forwarding & replica consistency
# for applications that use a single SQLite database.
# proxy:
# addr: ":8080"
# target: "localhost:8080"
# db: "db"
# passthrough:
# - "*.ico"
# - "*.png"
# This section defines a list of commands to run after LiteFS has connected
# and sync'd with the cluster. You can run multiple commands but LiteFS expects
# the last command to be long-running (e.g. an application server). When the
# last command exits, LiteFS is shut down.
exec:
- cmd: "npm run db:migrate:deployed"
if-candidate: true
# Initiate the long running process supervisor "overmind" to start the application
# This path is relative to this file
- cmd: "./deployments/scripts/start.sh"
# The lease section specifies how the cluster will be managed. We're using the
# "consul" lease type so that our application can dynamically change the primary.
#
# These environment variables will be available in your Fly.io application.
lease:
type: "consul"
advertise-url: "http://${FLY_ALLOC_ID}.vm.${FLY_APP_NAME}.internal:20202"
candidate: ${FLY_REGION == PRIMARY_REGION}
promote: true
consul:
url: "${FLY_CONSUL_URL}"
key: "litefs/${FLY_APP_NAME}-new"
This seems to be the main culprit. The primary exits messily, and then the second machine waits a full 2-minute TCP timeout before concluding that it is in fact gone.
My understanding of the docs is that this should instead only take ~10 seconds. I will try to replicate that timeout within my own experimental setup. (I’ve been meaning to look more into the “what could go wrong?” side of LiteFS, for a few months now.)
Regarding “messy”, this is what I see during primary shutdown, for comparison:
INFO Sending signal SIGINT to main child process w/ PID 313
sending signal to exec process
waiting for exec process to close
Web Server stopped. # my own litefs.yml.exec (Racket serve/servlet)
signal received, litefs shutting down
level=INFO msg="<hex>: exiting primary, destroying lease"
litefs shut down complete
INFO Main child exited normally with code: 0
The logs you posted instead show an escalation SIGINT → SIGTERM → VM exit.
[These are increasingly strident shutdown signals.]
This should be ok. The $HOSTNAME in that context refers to the address of the node within the private network. (This admittedly is not identical to /litefs/.primary, discussed earlier; the docs could maybe be polished a little more around these.)
We’re using the <machine_id>.vm.<appname>.internal style, third row of the address-forms table.
As a small side note, you will want to restore/fix this section, before launching to actual users…
And the subprocess script ignores all the shutdown requests and just keeps printing 3 (or 3r, if it thinks it’s a read-only replica).
#!/bin/bash
# ignore both shutdown signals, provoking a messy exit...
trap 'echo 3: sigint' SIGINT
trap 'echo 3: sigterm' SIGTERM
while true; do
SY=""
if [[ -f /litefs/.primary ]]; then
SY="r"
fi
echo "3${SY}"
sleep 30
done
Resulting logs, where 6* is the primary and 5* is the fallback…
22:53:08Z app[6*] ord [info]3
22:53:08Z app[5*] ord [info]3r
22:53:26Z app[6*] ord [info]sending signal to exec process
22:53:26Z app[6*] ord [info]waiting for exec process to close
22:53:26Z app[6*] ord [info] INFO Sending signal SIGINT to
main child process w/ PID 313
22:53:31Z app[6*] ord [info] INFO Sending signal SIGTERM to
main child process w/ PID 313
22:53:36Z app[6*] ord [warn]Virtual machine exited abruptly
22:53:38Z app[5*] ord [info]3r # still sees /litefs/.primary
22:54:08Z app[5*] ord [info]3r
22:54:38Z app[5*] ord [info]3r
22:55:08Z app[5*] ord [info]3r
22:55:38Z app[5*] ord [info]3r # ditto, two minutes later
22:56:06Z app[5*] ord [info]level=INFO msg="<hex>:
disconnected from primary with error, retrying: next frame: read tcp
[fdaa:*]:49148-> fdaa:*]:20202: read: connection timed out"
22:56:07Z app[5*] ord [info]level=INFO msg="<hex>:
primary lease acquired, advertising as
http://5*.vm.<app-name>.internal:20202"
22:56:08Z app[5*] ord [info]3 # finally declared itself primary
The good news is that it looks like the multi-minute handovers will go away if you can get everything to exit on the first SIGINT, which you want anyway.
(And the bad news is that this other aspect may not have a quick resolution…)
Do you mean the fact that if my process can’t exit within 5 seconds from the onset of SIGINT being sent, then I’m potentially screwed for way more than the proclaimed 10 seconds? Or what’s the other aspect?
I’ll try to figure out why my process(es) aren’t shutting down. I’ve never encountered anything like this… Then again, this is the first time I have a process calling a supervisor process, which calls 3 other processes… Not sure how this chaining really works. I’d imagine my bash + 2 node scripts would just terminate.
The way I understand this, is that the poller script receives the SIGINT from my CTRL + C, which exits with 0, which tells overmind to stop the other processes as well. After this has happened, overmind exits with 0, which exits litefs mount with 0, which kills the node gracefully.
Scenario with fly machine stop *e68
2024-05-10T10:38:08Z app[*e68] ams [info] INFO Sending signal SIGINT to main child process w/ PID 313
2024-05-10T10:38:08Z app[*e68] ams [info]sending signal to exec process
2024-05-10T10:38:08Z app[*e68] ams [info]waiting for exec process to close
2024-05-10T10:38:13Z app[*e68] ams [info] INFO Sending signal SIGTERM to main child process w/ PID 313
2024-05-10T10:38:18Z app[*e68] ams [warn]Virtual machine exited abruptly
Here (scenario like before) it looks like either litefs mount process or overmind process doesn’t handle relaying the SIGINT further down the process chain?
That’s a fair question; I rushed that out… Messy VM exits can’t be eliminated entirely, , even though they should be way less frequent than what you’re experiencing.
I doubt that a two minute purgatory is intended in such situations. After all, Fly’s architecture cherishes multiple machines as the way to avoid downtime!
Maybe both of us have overlooked some configuration nuance, or maybe LiteFS itself needs patching—if it’s going to be fully robust against hardware failure, .
In contrast, achieving single-second handover in the (far more common) fly machine stop scenario will likely just take a few more dashes of exec seasoning, and so on.
(As you’ve already intuited, the signal-passing chain can be surprisingly brittle, in general. Some parts of Unix were designed primarily around convenience of implementation on 1970s hardware…)
#!/bin/sh
# Start all processes in the Procfile (including worker)
# Poller will figure out whether or not to kill/restart the worker process based on the LiteFS events ("isPrimary" field)
echo "Starting all Procfile processes..."
overmind start -f deployments/Procfile
to
#!/bin/sh
# Start all processes in the Procfile (including worker)
# Poller will figure out whether or not to kill/restart the worker process based on the LiteFS events ("isPrimary" field)
echo "Starting all Procfile processes..."
exec overmind start -f deployments/Procfile # Add 'exec'
Was the missing piece. Now it looks like
14:13:59Z app[*e68] ams [info] INFO Sending signal SIGINT to main child process w/ PID 313
14:13:59Z app[*e68] ams [info]sending signal to exec process
14:13:59Z app[*e68] ams [info]waiting for exec process to close
14:13:59Z app[*e68] ams [info]backend | Interrupting...
14:13:59Z app[*e68] ams [info]worker | Interrupting...
14:13:59Z app[*e68] ams [info]poller | Interrupting...
14:13:59Z app[*e68] ams [info]backend | Received SIGINT, closing express server...
14:13:59Z app[*e68] ams [info]worker | Received SIGINT, closing worker...
14:13:59Z app[*e68] ams [info]poller | Shutting down...
14:14:00Z app[*e68] ams [info]poller | Exited with code 0
14:14:00Z app[*e68] ams [info]worker | Exited with code 0
14:14:00Z app[*e68] ams [info]backend | Exited with code 0
14:14:01Z app[*e68] ams [info]signal received, litefs shutting down
14:14:01Z app[*e68] ams [info]level=INFO msg="**: stream disconnected ([fdaa:5:a6f6:a7b:42:a729:cecb:2]:42644)"
14:14:01Z app[*e68] ams [info]level=INFO msg="exiting streaming backup"
14:14:01Z app[*e68] ams [info]level=INFO msg="begin streaming backup" full-sync-interval=10s
14:14:01Z app[*e68] ams [info]level=INFO msg="exiting streaming backup"
14:14:01Z app[*e68] ams [info]level=INFO msg="backup stream failed, retrying: fetch position map: Get \"https://litefs.fly.io/pos\": lease expired"
14:14:01Z app[*e68] ams [info]level=INFO msg="primary backup stream exiting"
14:14:01Z app[*e68] ams [info]level=INFO msg="**: exiting primary, destroying lease"
14:14:01Z app[*018] ams [info]level=INFO msg="**: disconnected from primary, retrying"
14:14:01Z app[*018] ams [info]poller | New LiteFS event: {"type":"primaryChange","data":{"isPrimary":false}}
14:14:01Z app[*018] ams [info]poller | isPrimary is false, stopping worker
14:14:01Z app[*e68] ams [info]litefs shut down complete
14:14:02Z app[*e68] ams [info] INFO Main child exited normally with code: 0
14:14:02Z app[*e68] ams [info] INFO Starting clean up.
14:14:02Z app[*e68] ams [info] INFO Umounting /dev/vdb from /var/lib/litefs
14:14:02Z app[*e68] ams [info][ 92.448468] reboot: Restarting system
14:14:02Z app[*018] ams [info]level=INFO msg="**: primary lease acquired, advertising as http://*018.vm.service.internal:20202"
14:14:02Z app[*018] ams [info]level=INFO msg="begin primary backup stream: url=https://litefs.fly.io"
14:14:02Z app[*018] ams [info]poller | New LiteFS event: {"type":"primaryChange","data":{"isPrimary":true}}
14:14:02Z app[*018] ams [info]poller | isPrimary is true, restarting worker
14:14:02Z app[*018] ams [info]worker | Restarting...
14:14:02Z app[*018] ams [info]worker | Restarted with pid 478...
I thought child processes automatically received and relayed stop signals…