Working on a Chubby-style lock service in Elixir, considering Fly.io for deployment. would love your suggestions!

Hey everyone,

I’m currently working on a Chubby-inspired distributed lock service in Elixir/OTP and considering Fly.io for deploying and running experiments.

This is a learning-focused implementation (not full Chubby), but I’m trying to capture the core ideas around distributed coordination:

  • coarse-grained locks (service-level coordination)

  • lease-based sessions (locks auto-release on expiry)

  • shared + exclusive lock modes

  • fencing tokens (lock sequencers)

  • push-based event notifications (with retry)

Design decisions

  • Consensus: Raft (using an Elixir implementation) for consistent lock metadata

  • Sessions: GenServer per client session (leveraging OTP for lifecycle + crash cleanup)

  • Lock acquisition: blocking with timeout + try-lock

  • Namespace: flat key space (no hierarchical FS)

  • Event delivery: push-based with retry (outbox pattern in Raft state)

  • Deployment: starting with a single cluster (not tackling cross-DC yet)

What I want to explore

I’m planning to deploy nodes across regions and experiment with:

  • lease expiry behavior under latency

  • leader failover (node crashes / restarts)

  • lock contention patterns

  • impact of network delays on coordination

Fly.io seems like a great fit for this kind of setup, especially for running BEAM nodes in multiple regions and simulating real-world failure scenarios.

Questions

  • What’s the recommended way to run BEAM clusters on Fly across regions?

  • Any known pitfalls with distributed Erlang in this setup?

  • Would Fly Machines or Nomad be better suited for this kind of system?

  • How reliable is private networking for coordination-heavy workloads like this?

If this sounds interesting, I’d be happy to share results and a detailed write-up once I have a working version.

Also open to feedback if anyone has built something similar on Fly.

Thanks!