Postgres Flex: A former primary DB machine came back as a zombie and kept serving read-only writes

Cluster: onth-database, postgres-flex, 3 machines (2× sin, 1× nrt), image flyio/postgres-flex:17.2 v0.0.66 at the time.

Impact: ~3h of intermittent write failures on our API (PG::ReadOnlySqlTransaction + PG::ConnectionBad); reads were ~unaffected.

Timeline (UTC, 2026-06-26):

  • 06:53 automatic failover: a standby was promoted to primary (timeline TL2→TL3, fork at LSN 2/DBC132A8).
  • 09:12 the former primary machine (7813632b595198, sin) was restarted (host event?) and came back as a zombie: zombie.lock present, logs looping Unable to confirm that we are the true primary / resolved primary ‘’. Its pg health check was critical.
  • The zombie kept receiving connections via .flycast despite that critical check, so writes hit a read-only node → PG::ReadOnlySqlTransaction. It was also looping Failed to restart haproxy on member …: 500 for all members, which appeared to destabilize the proxy layer cluster-wide (PG::ConnectionBad, SSL EOF) even toward the healthy primary.
  • ~09:50 I run fly machine stop to stop the zombie → errors stopped immediately then re-cloned a fresh replica and destroyed the zombie. Cluster has since been updated to v0.1.0.

hypothesis: a machine that was previously the primary got fenced/downgraded to read-only (“zombie”) but was still treated as a routable backend, fly-proxy/flycast kept sending client connections to it even though its pg/role check was critical, so writes landed on a read-only node. It seems to be a flex/proxy behavior (I was connected via onth-database.flycast, 3-node topology, image was v0.0.66).

Hm… As I understand it, you would need all† 3 Machines to be in the same region (sin) in order to get the designed/intended clustering behavior:

https://community.fly.io/t/does-postgres-regional-failover-happen-automatically/12458/3

My guess is that the routing that you saw was a side effect, and that both sin Machines were unsure whether they were really the primary at this point (since it was a 1-1 tie in the voting(?)). I’m not one of the forum’s Postgres clustering experts, though, so I could be wrong here.

Singapore does have Managed Postgres, so that would be the supported and really the best way to keep things like this from happening again.


†You can create remote replicas once you have a full three in the primary region. I.e., the nrt guy would be your fourth Machine.