Postgres database primary region node (failed to connect) pg check failing

grilla · December 17, 2022, 10:03pm

The leader db in my postgres cluster is stuck on an older version than the replica dbs.

Instances
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
c70****** app 16 ⇡ lax run running (replica) 3 total, 2 passing, 1 critical 0 42m17s ago
********* app 16 ⇡ ewr run running (replica) 3 total, 2 passing, 1 critical 0 42m17s ago
********* app 14 dfw run running (replica) 3 total, 2 passing, 1 critical 0 1h5m ago
********* app 14 mia run running (failed to co) 3 total, 1 passing, 2 critical 0 1h5m ago

I am not sure what to do. I have tried restarting the db and scaling it. No effect.

This happened when I scaled from 8GB RAM to 32GB RAM and back to 8 after a couple of minutes. After about 30 minutes of waiting for DB to pick itself up, I switched to 32GB RAM. After 10 more minutes, I switched to 31.

I think switching this has caused my problem. Maybe I just have to wait now.

Log from the stuck VM:

2022-12-17T22:44:50Z app[*****] mia [info]exporter | ERRO[1117] Error opening connection to database (postgresql://flypgadmin:PASSWORD_REMOVED@[IP_REMOVED]:PORT_REMOVE/postgres?sslmode=disable): dial tcp [****: connect: connection refused  source="postgres_exporter.go:1658"

grilla · December 18, 2022, 2:46pm

How my problem was resolved.

The trick was to look at the volumes (which I did):

> fly volumes list -a my-postgres-cluster

```table
ID          STATE   NAM SIZE   REGION  ZONE    ENCRYPTED       ATTACHED VM     CREATED AT   
vol_****    created *** xGB    mia     **    true           *****        2 months ago
vol_****    created *** xGB    dfw     **    true            *****        3 days ago  
vol_****    created *** xGB    ewr     **    true            *****        1 week ago  
vol_****    created *** xGB    mia     **    true                   2 months ago
vol_****    created *** xGB    lax     **    true            *****        2 months ago

and to realize that one of the volumes was not attached to a VM.

Even though I saw that a volume was unattached I didn’t think to simply allow an instance to exist for that volume. It seems obvious now, right?

So to fix it, I had to change my scale. Originally, I was using this:

fly scale count 4 --max-per-region=1 -a my-postgres-cluster

and I switched it to this:

fly scale count 5 --max-per-region=2 -a my-postgres-cluster

This allowed the duplicate-region volume to be accounted for. A new cluster mia instance was created alongside the existing mia instance. The cluster could then heal itself.

I had to email fly support to get this understanding. Lessons learned the hard way.

PS the reason I switched RAM rapidly was because I misread a graph on Grafana. I thought RAM was full but actually used RAM was close to 0 and the graph was showing total RAM. Another lesson learned, the very hard way.

Topic		Replies	Views
PostgreSQL Database in Failing State Questions / Help postgres	4	729	July 18, 2022
PosgreSQL on Fly: 1 critical health check	10	602	December 20, 2021
Postgres cluster broken since last Fly migration Questions / Help postgres	1	108	July 2, 2024
No eligible masters and db unavailable after moving primary region and running pg-failover	7	332	October 3, 2022
Postgres health checks perpetually failing Questions / Help postgres	3	949	March 2, 2023

Postgres database primary region node (failed to connect) pg check failing

Related topics