panic: FLY_CONSUL_URL or CONSUL_URL are required with postgres-ha deploy

I’ve been able to partially fix this with the following steps:

  1. fly scale count 0. Note, because of the aforementioned lack of volumes being mounted this lost everything, but I had the data backed up via wal-g
  2. fly consul attach (see here)
  3. fly scale count 1
  4. Volume was mounted here (but empty), then I restored from the wal-g backup.

At this point I was stuck in a boot loop where it was trying to update the database with the current OPERATOR_PASSWORD but failing because the database was in a readonly state. It also was identifying as a replica in the Fly UI. I fixed this by forcibly promoting (thanks to this SO answer):

su stolon
/usr/lib/postgresql/14/bin/pg_ctl promote -D /data/postgres

At this point I had a working leader, but no redundancy, so I attempted to fly scale count 2, but the replica would again bootloop checking stolon. The DB doesn’t seem to be coming up or something because

export $(cat /data/.env | xargs)
stolonctl status

would give me something like the following:

=== Keepers ===

UID             HEALTHY PG LISTENADDRESS                        PG HEALTHY      PG WANTEDGENERATION     PG CURRENTGENERATION
232f9d636332    true    fdaa:0:47b5:a7b:232:f6de:cf8c:2:5433    true            1                       0
233582125d22    false   (no db assigned)        false   0       0

While I am getting constant errors in the monitoring console:

 2024-02-06T06:56:19.675 app[6e824532a22108] syd [info] exporter | INFO[0046] Established new database connection to "fdaa:0:47b5:a7b:233:5821:25d2:2:5433". source="postgres_exporter.go:970"

2024-02-06T06:56:20.141 app[6e824532a22108] syd [info] checking stolon status

2024-02-06T06:56:20.676 app[6e824532a22108] syd [info] exporter | ERRO[0047] Error opening connection to database (postgresql://flypgadmin:PASSWORD_REMOVED@[fdaa:0:47b5:a7b:233:5821:25d2:2]:5433/postgres?sslmode=disable): dial tcp [fdaa:0:47b5:a7b:233:5821:25d2:2]:5433: connect: connection refused source="postgres_exporter.go:1658"

2024-02-06T06:56:21.141 app[6e824532a22108] syd [info] checking stolon status

2024-02-06T06:56:22.142 app[6e824532a22108] syd [info] checking stolon status 

However I can connect directly to the DB if I ssh into the failing machine and run psql, plus the password on the leader works, so there is some synchronisation going on. I will probably give up on this sometime soon, leaving a single leader until I am ready to move to postgres-flex. However creating that is giving me a 504 at the moment - but that’s another issue.

1 Like