Postgres database primary region node (failed to connect) pg check failing

The leader db in my postgres cluster is stuck on an older version than the replica dbs.

Instances
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
c70****** app 16 ⇡ lax run running (replica) 3 total, 2 passing, 1 critical 0 42m17s ago
********* app 16 ⇡ ewr run running (replica) 3 total, 2 passing, 1 critical 0 42m17s ago
********* app 14 dfw run running (replica) 3 total, 2 passing, 1 critical 0 1h5m ago
********* app 14 mia run running (failed to co) 3 total, 1 passing, 2 critical 0 1h5m ago

I am not sure what to do. I have tried restarting the db and scaling it. No effect.

This happened when I scaled from 8GB RAM to 32GB RAM and back to 8 after a couple of minutes. After about 30 minutes of waiting for DB to pick itself up, I switched to 32GB RAM. After 10 more minutes, I switched to 31.

I think switching this has caused my problem. Maybe I just have to wait now.

Log from the stuck VM:

2022-12-17T22:44:50Z app[*****] mia [info]exporter | ERRO[1117] Error opening connection to database (postgresql://flypgadmin:PASSWORD_REMOVED@[IP_REMOVED]:PORT_REMOVE/postgres?sslmode=disable): dial tcp [****: connect: connection refused  source="postgres_exporter.go:1658"

How my problem was resolved.

The trick was to look at the volumes (which I did):

> fly volumes list -a my-postgres-cluster

```table
ID          STATE   NAM SIZE   REGION  ZONE    ENCRYPTED       ATTACHED VM     CREATED AT   
vol_****    created *** xGB    mia     **    true           *****        2 months ago
vol_****    created *** xGB    dfw     **    true            *****        3 days ago  
vol_****    created *** xGB    ewr     **    true            *****        1 week ago  
vol_****    created *** xGB    mia     **    true                   2 months ago
vol_****    created *** xGB    lax     **    true            *****        2 months ago

and to realize that one of the volumes was not attached to a VM.

Even though I saw that a volume was unattached I didn’t think to simply allow an instance to exist for that volume. It seems obvious now, right?

So to fix it, I had to change my scale. Originally, I was using this:

fly scale count 4 --max-per-region=1 -a my-postgres-cluster

and I switched it to this:

fly scale count 5 --max-per-region=2 -a my-postgres-cluster

This allowed the duplicate-region volume to be accounted for. A new cluster mia instance was created alongside the existing mia instance. The cluster could then heal itself.

I had to email fly support to get this understanding. Lessons learned the hard way.

PS the reason I switched RAM rapidly was because I misread a graph on Grafana. I thought RAM was full but actually used RAM was close to 0 and the graph was showing total RAM. Another lesson learned, the very hard way.

1 Like