I’ve been able to partially fix this with the following steps:
fly scale count 0
. Note, because of the aforementioned lack of volumes being mounted this lost everything, but I had the data backed up via wal-gfly consul attach
(see here)fly scale count 1
- Volume was mounted here (but empty), then I restored from the wal-g backup.
At this point I was stuck in a boot loop where it was trying to update the database with the current OPERATOR_PASSWORD but failing because the database was in a readonly state. It also was identifying as a replica in the Fly UI. I fixed this by forcibly promoting (thanks to this SO answer):
su stolon
/usr/lib/postgresql/14/bin/pg_ctl promote -D /data/postgres
At this point I had a working leader, but no redundancy, so I attempted to fly scale count 2
, but the replica would again bootloop checking stolon. The DB doesn’t seem to be coming up or something because
export $(cat /data/.env | xargs)
stolonctl status
would give me something like the following:
=== Keepers ===
UID HEALTHY PG LISTENADDRESS PG HEALTHY PG WANTEDGENERATION PG CURRENTGENERATION
232f9d636332 true fdaa:0:47b5:a7b:232:f6de:cf8c:2:5433 true 1 0
233582125d22 false (no db assigned) false 0 0
While I am getting constant errors in the monitoring console:
2024-02-06T06:56:19.675 app[6e824532a22108] syd [info] exporter | INFO[0046] Established new database connection to "fdaa:0:47b5:a7b:233:5821:25d2:2:5433". source="postgres_exporter.go:970"
2024-02-06T06:56:20.141 app[6e824532a22108] syd [info] checking stolon status
2024-02-06T06:56:20.676 app[6e824532a22108] syd [info] exporter | ERRO[0047] Error opening connection to database (postgresql://flypgadmin:PASSWORD_REMOVED@[fdaa:0:47b5:a7b:233:5821:25d2:2]:5433/postgres?sslmode=disable): dial tcp [fdaa:0:47b5:a7b:233:5821:25d2:2]:5433: connect: connection refused source="postgres_exporter.go:1658"
2024-02-06T06:56:21.141 app[6e824532a22108] syd [info] checking stolon status
2024-02-06T06:56:22.142 app[6e824532a22108] syd [info] checking stolon status
However I can connect directly to the DB if I ssh into the failing machine and run psql
, plus the password on the leader works, so there is some synchronisation going on. I will probably give up on this sometime soon, leaving a single leader until I am ready to move to postgres-flex. However creating that is giving me a 504 at the moment - but that’s another issue.