Cluster: onth-database, postgres-flex, 3 machines (2× sin, 1× nrt), image flyio/postgres-flex:17.2 v0.0.66 at the time.
Impact: ~3h of intermittent write failures on our API (PG::ReadOnlySqlTransaction + PG::ConnectionBad); reads were ~unaffected.
Timeline (UTC, 2026-06-26):
- 06:53 automatic failover: a standby was promoted to primary (timeline TL2→TL3, fork at LSN 2/DBC132A8).
- 09:12 the former primary machine (7813632b595198, sin) was restarted (host event?) and came back as a zombie: zombie.lock present, logs looping Unable to confirm that we are the true primary / resolved primary ‘’. Its pg health check was critical.
- The zombie kept receiving connections via .flycast despite that critical check, so writes hit a read-only node → PG::ReadOnlySqlTransaction. It was also looping Failed to restart haproxy on member …: 500 for all members, which appeared to destabilize the proxy layer cluster-wide (PG::ConnectionBad, SSL EOF) even toward the healthy primary.
- ~09:50 I run fly machine stop to stop the zombie → errors stopped immediately then re-cloned a fresh replica and destroyed the zombie. Cluster has since been updated to v0.1.0.
hypothesis: a machine that was previously the primary got fenced/downgraded to read-only (“zombie”) but was still treated as a routable backend, fly-proxy/flycast kept sending client connections to it even though its pg/role check was critical, so writes landed on a read-only node. It seems to be a flex/proxy behavior (I was connected via onth-database.flycast, 3-node topology, image was v0.0.66).