Postgres: no eligible masters

Every once in a while, one of my app’s databases on Fly.io stop responding. For elixir apps, for example, we start seeing something like:

2022-06-30T20:56:06Z app[6c6493f5] dfw [info] Postgrex.Protocol (#PID<0.2215.0>) failed to connect: ** (DBConnection.ConnectionError) tcp recv (idle): closed

Restarting the database doesn’t fix this. Looking at the database logs, it’s almost always something along the lines of:

2022-06-30T21:00:00Z app[35b6bdd4] dfw [info]sentinel | 2022-06-30T21:00:00.129Z	WARN	cmd/sentinel.go:276	no keeper info available	{"db": "d09b842c", "keeper": "12de116392"}
2022-06-30T21:00:00Z app[35b6bdd4] dfw [info]sentinel | 2022-06-30T21:00:00.132Z	INFO	cmd/sentinel.go:995	master db is failed	{"db": "d09b842c", "keeper": "12de116392"}
2022-06-30T21:00:00Z app[35b6bdd4] dfw [info]sentinel | 2022-06-30T21:00:00.132Z	INFO	cmd/sentinel.go:1006	trying to find a new master to replace failed master
2022-06-30T21:00:00Z app[35b6bdd4] dfw [info]sentinel | 2022-06-30T21:00:00.132Z	INFO	cmd/sentinel.go:451	ignoring keeper since its behind that maximum xlog position	{"db": "c3e401f7", "dbXLogPos": 2821990992, "masterXLogPos": 11323727776}
2022-06-30T21:00:00Z app[35b6bdd4] dfw [info]sentinel | 2022-06-30T21:00:00.133Z	ERROR	cmd/sentinel.go:1009	no eligible masters

The result of fly status is:

Instances
ID      	PROCESS	VERSION	REGION	DESIRED	STATUS            	HEALTH CHECKS                 	RESTARTS	CREATED
35b6bdd4	app    	8      	dfw   	stop   	running (replica) 	3 total, 1 passing, 2 critical	0       	6m41s ago

To fix this I have to scale the DB app to 0, wait about a minute, and then scale it back to 1:

fly scale count 0 -a myapp-db
# wait a min until `fly status` shows no instances
fly scale count 1 -a myapp-db

Note that I usually have just one instance of the database running, and have never attempted to change regions. This is completely random and is not linked to times when I manually restart/deploy the database.

This used to happen more often last year, but this just happened today.

My theory is Fly restarts/redeploys the database every once in a while for whatever reason (maintenance or something else). As part of that, a new instance is added as a replica to the existing database cluster of 1, and the old primary is removed. But the new instance remains a replica and doesn’t switch to primary.

I ran into the same issue, luckly for us it happened on our staging servers, we had to scale the postgres vm’s down and back up to fix.

Overall fly postgres seems to be having quite a bit of issues, in prod we get random DB connection issues from time to time, more frequent than expected

Bumping! Can anyone from Fly chime in and let us know what we should do here?

Yes, I have another thread here with various DB errors we occasionally get: Postgres DBConnection and other errors on Fly.

The “failed master” message means Postgres crashed and can’t get started again. It’s probably repeating that so much in logs you can’t really tell, though. The most common cause of this is an out-of-memory condition.

Scaling to zero and then one replaces the VM entirely. When a new VM comes up, we do a little cleanup to let it restart at master. You can also run fly vm stop <id> to do the same thing.

This is unlikely to be caused by us. It’s hard to troubleshoot a crashed postgres, but the first thing I’d do is check RAM size and then see if giving it 1GB or so fixes the problem.

We never add instances or replicas, these basically run full time. If they restart, we just bring them back up where they were.