LiteFS sync fail

mostafatouny · October 21, 2025, 8:18pm

Hello,

I deployed my app using LiteFS, and the following scenario fails the syncronization:

machine 1 machine 2 are stopped.
user requests the app.
machine 1 starts for serving.
machine 1 receives post request
machien 1 stops.
user requests the app.
machine 2 started, and promoted to primary, without replicating machine 1.
machine 2 receives post request, writing to its local instance.

You can easily replicate the case by manually starting and stopping machines.

I thought, if all machines are stopped, fly would remember the last primary instance, so that a new started instance replicates it. However, it does not happen. Actually I get the following upon fly status

PROCESS	ID            	VERSION	REGION	STATE  	ROLE   	CHECKS	LAST UPDATED         
app    	568372e0b7238e	20     	arn   	stopped	primary	     	2025-10-21T20:12:36Z	
app    	9185034db11683	20     	arn   	stopped	primary	     	2025-10-21T20:06:13Z	
app    	e2863e92b77486	20     	arn   	stopped	primary	     	2025-10-21T18:48:53Z

my litefs.yml is identical to the doc except the line rails db:migrate.

mayailurus · October 21, 2025, 8:29pm

No, this is a known limitation, . You need to keep ≥2 Machines running at all times in the primary region, otherwise you do see regressions of the type that you encountered. (Other users have bumped into this in the past.)

The official recommendation is stronger now, by the way:

Do not combine LiteFS with autostop/autostart on Fly Machines. The Fly Proxy’s autoscaler can shut down or restart Machines without any awareness of LiteFS lease ownership or data freshness, which can result in a stale machine winning the lease and LiteFS discarding newer changes and LTX file data—risking rollback and data loss.

Hope this helps clear up the uncertainty a little!

mostafatouny · October 21, 2025, 8:34pm

Thank you for the note.

what about static leasing? is it compatible with scale to zero?

mayailurus · October 21, 2025, 8:52pm

Not completely to zero, I don’t think. The Fly Proxy’s autoscaling doesn’t know the dependencies between the different Machines, basically. If you wake up a replica and it can’t connect to the primary, then it will typically balk—if memory isn’t failing me. (I think that extra safeguard was introduced in response to an incident within Fly.io’s own infrastructure a couple years back.)

What you could do is create your own mini-scaler, via Fly-Replay, in a small, dedicated router app that knows to always wake up the primary (via the Machines API) before any of the others, etc. That would be a fair amount of work, though.

Aside: That second link covers a more complicated scenario than what you would really need, but it’s the best single summary of all the pieces that I know of…

mostafatouny · October 21, 2025, 8:59pm

Thank you @mayailurus.

So in conclusion, fly auto-scaling is officially not compatible with LiteFS, and your personal advise is enabling auto-scaling with a min of two machines running.

mayailurus · October 21, 2025, 9:29pm

Yep, at least two in the primary region. The min_machines_running = 2 setting will enforce that constraint for you with no effort.

You can see where I originally got that advice from in the following old thread. (Ben Johnson there is a Fly.io employee—unlike myself.)

https://community.fly.io/t/understanding-litefs-for-rarely-up-architecture/15811

I don’t know the story of why the official recommendation became more conservative—possibly just lack of docs/Support bandwidth to explain the nuances in really full detail, …