I had our app set up with a leader and a replica in the same region. This was working fine for many months. I’m not sure when the issue started but I believe it was around 7am pacific time, when both machines decided that they were the leader. This put our app into a read-only state.
Here’s the app status at the time of failure:
ID STATE ROLE REGION CHECKS IMAGE CREATED UPDATED
9e784575a09683 started leader sjc 3 total, 3 passing flyio/postgres:14.6 (v0.0.41) 2023-02-16T18:02:53Z 2023-07-29T05:09:00Z
9185d56b445e83 started leader sjc 3 total, 2 passing, 1 critical flyio/postgres:14.6 (v0.0.41) 2023-02-16T18:02:28Z 2023-08-31T16:04:36Z
I have since stopped one of the machines so that there is only one leader and no failover replica and the database is working again. I’ve emailed support directly but thought I’d post it here as well in case anyone else runs into the same issue or somebody has any idea how this happened in the first place.