panic: FLY_CONSUL_URL or CONSUL_URL are required with postgres-ha deploy

Hi there,

I recently realised that my postgres-ha app hasn’t had any volumes attached since I transitioned from Nomad to the Machines architecture and I am trying to remedy that.

I’ve verified I have the appropriate mount:

   source = 'pg_data_machines'       
   destination = '/data'

And matching volumes:

➜ fly volumes list -a orchestrate-db
ID                      STATE   NAME                    SIZE    REGION  ZONE    ENCRYPTED       ATTACHED VM     CREATED AT     
vol_7vzoxd2pk6n671q4    created pg_data_machines        3GB     syd     8fbc    true                            39 minutes ago
vol_q4q1md8kk5nme1gv    created pg_data_machines        3GB     syd     843a    true                            39 minutes ago

However, when I fly deploy to get a machine with the attached volume FLY_CONSUL_URL doesn’t seem to be set:

Updating existing machines in 'orchestrate-db' with rolling strategy                                                        
Smoke checks for 6e824532a20758 failed: the app appears to be crashing                                                      
Check its logs: here's the last lines below, or run 'fly logs -i 6e824532a20758':                                           
  Pulling container image                                                          
  Successfully prepared image (1.192850135s)                                       
  Setting up volume 'pg_data_machines'                                                                                      
  Opening encrypted volume                                                                                                  
  Configuring firecracker                                                                                                   
  [    0.035372] Spectre V2 : WARNING: Unprivileged eBPF is enabled with eIBRS on, data leaks possible via Spectre v2 BHB at
  [    0.038879] PCI: Fatal: No config space access function found                                                          
   INFO Starting init (commit: bfa79be)...                                                                                  
   INFO Mounting /dev/vdb at /data w/ uid: 0, gid: 0 and chmod 0755                                                         
   INFO Resized /data to 3204448256 bytes                                                                                   
   INFO Preparing to run: ` start` as root                                                              
   INFO [fly api proxy] listening at /.fly/api                                                                              
  2024/02/02 00:10:25 listening on [fdaa:0:47b5:a7b:233:be1b:1ef5:2]:22 (DNS: [fdaa::3]:53)                                 
  Machine created and started in 3.307s                                                                                     
  panic: FLY_CONSUL_URL or CONSUL_URL are required                                                                          
  goroutine 1 [running]:                                                                                                    
        /go/src/ +0x1c13                                            
   INFO Main child exited normally with code: 2                                                                             
   INFO Starting clean up.                                                                                                  
   INFO Umounting /dev/vdb from /data                                                                                       
   WARN hallpass exited, pid: 314, status: signal: 15 (SIGTERM)                                                             
  2024/02/02 00:10:26 listening on [fdaa:0:47b5:a7b:233:be1b:1ef5:2]:22 (DNS: [fdaa::3]:53)                                 
  [    2.244624] reboot: Restarting system
  machine did not have a restart policy, defaulting to restart

Consul should be enabled because of:

  enable_consul = true

So I am unsure of what is going on. Short of recreating the cluster, is there anything I’m doing wrong? (I plan to move to postgres-flex some time soon, but want this fixed in the meantime!)


Added consul, which doesn’t always show up in auto-suggest…

I’ve been able to partially fix this with the following steps:

  1. fly scale count 0. Note, because of the aforementioned lack of volumes being mounted this lost everything, but I had the data backed up via wal-g
  2. fly consul attach (see here)
  3. fly scale count 1
  4. Volume was mounted here (but empty), then I restored from the wal-g backup.

At this point I was stuck in a boot loop where it was trying to update the database with the current OPERATOR_PASSWORD but failing because the database was in a readonly state. It also was identifying as a replica in the Fly UI. I fixed this by forcibly promoting (thanks to this SO answer):

su stolon
/usr/lib/postgresql/14/bin/pg_ctl promote -D /data/postgres

At this point I had a working leader, but no redundancy, so I attempted to fly scale count 2, but the replica would again bootloop checking stolon. The DB doesn’t seem to be coming up or something because

export $(cat /data/.env | xargs)
stolonctl status

would give me something like the following:

=== Keepers ===

232f9d636332    true    fdaa:0:47b5:a7b:232:f6de:cf8c:2:5433    true            1                       0
233582125d22    false   (no db assigned)        false   0       0

While I am getting constant errors in the monitoring console:

 2024-02-06T06:56:19.675 app[6e824532a22108] syd [info] exporter | INFO[0046] Established new database connection to "fdaa:0:47b5:a7b:233:5821:25d2:2:5433". source="postgres_exporter.go:970"

2024-02-06T06:56:20.141 app[6e824532a22108] syd [info] checking stolon status

2024-02-06T06:56:20.676 app[6e824532a22108] syd [info] exporter | ERRO[0047] Error opening connection to database (postgresql://flypgadmin:PASSWORD_REMOVED@[fdaa:0:47b5:a7b:233:5821:25d2:2]:5433/postgres?sslmode=disable): dial tcp [fdaa:0:47b5:a7b:233:5821:25d2:2]:5433: connect: connection refused source="postgres_exporter.go:1658"

2024-02-06T06:56:21.141 app[6e824532a22108] syd [info] checking stolon status

2024-02-06T06:56:22.142 app[6e824532a22108] syd [info] checking stolon status 

However I can connect directly to the DB if I ssh into the failing machine and run psql, plus the password on the leader works, so there is some synchronisation going on. I will probably give up on this sometime soon, leaving a single leader until I am ready to move to postgres-flex. However creating that is giving me a 504 at the moment - but that’s another issue.

1 Like