Every once in a while, one of my app’s databases on Fly.io stop responding. For elixir apps, for example, we start seeing something like:
2022-06-30T20:56:06Z app[6c6493f5] dfw [info] Postgrex.Protocol (#PID<0.2215.0>) failed to connect: ** (DBConnection.ConnectionError) tcp recv (idle): closed
Restarting the database doesn’t fix this. Looking at the database logs, it’s almost always something along the lines of:
2022-06-30T21:00:00Z app[35b6bdd4] dfw [info]sentinel | 2022-06-30T21:00:00.129Z WARN cmd/sentinel.go:276 no keeper info available {"db": "d09b842c", "keeper": "12de116392"}
2022-06-30T21:00:00Z app[35b6bdd4] dfw [info]sentinel | 2022-06-30T21:00:00.132Z INFO cmd/sentinel.go:995 master db is failed {"db": "d09b842c", "keeper": "12de116392"}
2022-06-30T21:00:00Z app[35b6bdd4] dfw [info]sentinel | 2022-06-30T21:00:00.132Z INFO cmd/sentinel.go:1006 trying to find a new master to replace failed master
2022-06-30T21:00:00Z app[35b6bdd4] dfw [info]sentinel | 2022-06-30T21:00:00.132Z INFO cmd/sentinel.go:451 ignoring keeper since its behind that maximum xlog position {"db": "c3e401f7", "dbXLogPos": 2821990992, "masterXLogPos": 11323727776}
2022-06-30T21:00:00Z app[35b6bdd4] dfw [info]sentinel | 2022-06-30T21:00:00.133Z ERROR cmd/sentinel.go:1009 no eligible masters
The result of fly status
is:
Instances
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
35b6bdd4 app 8 dfw stop running (replica) 3 total, 1 passing, 2 critical 0 6m41s ago
To fix this I have to scale the DB app to 0
, wait about a minute, and then scale it back to 1
:
fly scale count 0 -a myapp-db
# wait a min until `fly status` shows no instances
fly scale count 1 -a myapp-db
Note that I usually have just one instance of the database running, and have never attempted to change regions. This is completely random and is not linked to times when I manually restart/deploy the database.
This used to happen more often last year, but this just happened today.
My theory is Fly restarts/redeploys the database every once in a while for whatever reason (maintenance or something else). As part of that, a new instance is added as a replica to the existing database cluster of 1, and the old primary is removed. But the new instance remains a replica and doesn’t switch to primary.