Hey everyone,
I’m currently working on a Chubby-inspired distributed lock service in Elixir/OTP and considering Fly.io for deploying and running experiments.
This is a learning-focused implementation (not full Chubby), but I’m trying to capture the core ideas around distributed coordination:
-
coarse-grained locks (service-level coordination)
-
lease-based sessions (locks auto-release on expiry)
-
shared + exclusive lock modes
-
fencing tokens (lock sequencers)
-
push-based event notifications (with retry)
Design decisions
-
Consensus: Raft (using an Elixir implementation) for consistent lock metadata
-
Sessions: GenServer per client session (leveraging OTP for lifecycle + crash cleanup)
-
Lock acquisition: blocking with timeout + try-lock
-
Namespace: flat key space (no hierarchical FS)
-
Event delivery: push-based with retry (outbox pattern in Raft state)
-
Deployment: starting with a single cluster (not tackling cross-DC yet)
What I want to explore
I’m planning to deploy nodes across regions and experiment with:
-
lease expiry behavior under latency
-
leader failover (node crashes / restarts)
-
lock contention patterns
-
impact of network delays on coordination
Fly.io seems like a great fit for this kind of setup, especially for running BEAM nodes in multiple regions and simulating real-world failure scenarios.
Questions
-
What’s the recommended way to run BEAM clusters on Fly across regions?
-
Any known pitfalls with distributed Erlang in this setup?
-
Would Fly Machines or Nomad be better suited for this kind of system?
-
How reliable is private networking for coordination-heavy workloads like this?
If this sounds interesting, I’d be happy to share results and a detailed write-up once I have a working version.
Also open to feedback if anyone has built something similar on Fly.
Thanks!