The leader db in my postgres cluster is stuck on an older version than the replica dbs.
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
c70****** app 16 ⇡ lax run running (replica) 3 total, 2 passing, 1 critical 0 42m17s ago
********* app 16 ⇡ ewr run running (replica) 3 total, 2 passing, 1 critical 0 42m17s ago
********* app 14 dfw run running (replica) 3 total, 2 passing, 1 critical 0 1h5m ago
********* app 14 mia run running (failed to co) 3 total, 1 passing, 2 critical 0 1h5m ago
I am not sure what to do. I have tried restarting the db and scaling it. No effect.
This happened when I scaled from 8GB RAM to 32GB RAM and back to 8 after a couple of minutes. After about 30 minutes of waiting for DB to pick itself up, I switched to 32GB RAM. After 10 more minutes, I switched to 31.
I think switching this has caused my problem. Maybe I just have to wait now.
Log from the stuck VM:
2022-12-17T22:44:50Z app[*****] mia [info]exporter | ERRO Error opening connection to database (postgresql://flypgadmin:PASSWORD_REMOVED@[IP_REMOVED]:PORT_REMOVE/postgres?sslmode=disable): dial tcp [****: connect: connection refused source="postgres_exporter.go:1658"
How my problem was resolved.
The trick was to look at the volumes (which I did):
> fly volumes list -a my-postgres-cluster
ID STATE NAM SIZE REGION ZONE ENCRYPTED ATTACHED VM CREATED AT
vol_**** created *** xGB mia ** true ***** 2 months ago
vol_**** created *** xGB dfw ** true ***** 3 days ago
vol_**** created *** xGB ewr ** true ***** 1 week ago
vol_**** created *** xGB mia ** true 2 months ago
vol_**** created *** xGB lax ** true ***** 2 months ago
and to realize that one of the volumes was not attached to a VM.
Even though I saw that a volume was unattached I didn’t think to simply allow an instance to exist for that volume. It seems obvious now, right?
So to fix it, I had to change my
scale. Originally, I was using this:
fly scale count 4 --max-per-region=1 -a my-postgres-cluster
and I switched it to this:
fly scale count 5 --max-per-region=2 -a my-postgres-cluster
This allowed the duplicate-region volume to be accounted for. A new cluster
mia instance was created alongside the existing
mia instance. The cluster could then heal itself.
I had to email fly support to get this understanding. Lessons learned the hard way.
PS the reason I switched RAM rapidly was because I misread a graph on Grafana. I thought RAM was full but actually used RAM was close to 0 and the graph was showing total RAM. Another lesson learned, the very hard way.