I have two Postgres clusters with Fly, each with 2 VMs in a single region (sea). I haven’t done any special configuration, just flyctl postgres create
and then I was off to the races (I love Fly).
Recently (starting earlier this week) I started having connection problems with both clusters. The connections just hang until I restart the VMs manually with flyctl restart
. The associated volumes are fine. I’m connecting with the .internal
addresses and get the same results whether my server’s trying to connect or I’m trying to connect locally using Wireguard.
I see a lot of critical health checks and restarts when I run flyctl status
:
Instances
ID VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
8a178c3e 0 sea run running (rpc error: c) 3 total, 3 critical 4 2021-08-12T14:54:44Z
bff97a3f 0 sea run running (replica) 3 total, 3 passing 6 2021-08-12T00:47:13Z
The logs consist of the following messages over and over, not sure if these are normal or not:
2021-08-14T16:06:25.097404391Z app[8a178c3e] sea [info] sentinel | 2021-08-14T16:02:14.561Z WARN cmd/sentinel.go:276 no keeper info available {"db": "8e991ba6", "keeper": "ac4024872"}
2021-08-14T16:06:25.097417606Z app[8a178c3e] sea [info] keeper | 2021-08-14T16:02:15.871Z INFO cmd/keeper.go:1505 our db requested role is master
2021-08-14T16:06:25.097430727Z app[8a178c3e] sea [info] keeper | 2021-08-14T16:02:15.873Z INFO cmd/keeper.go:1543 already master
2021-08-14T16:06:25.097493740Z app[8a178c3e] sea [info] keeper | 2021-08-14T16:02:15.893Z INFO cmd/keeper.go:1676 postgres parameters not changed
2021-08-14T16:06:25.097500838Z app[8a178c3e] sea [info] keeper | 2021-08-14T16:02:15.894Z INFO cmd/keeper.go:1703 postgres hba entries not changed
2021-08-14T16:06:25.097514737Z app[8a178c3e] sea [info] proxy | 2021-08-14T16:02:19.608Z INFO cmd/proxy.go:268 master address {"address": "[fdaa:0:22f0:a7b:ac2:0:2488:2]:5433"}
2021-08-14T16:06:25.097517812Z app[8a178c3e] sea [info] proxy | 2021-08-14T16:02:19.885Z INFO cmd/proxy.go:286 proxying to master address {"address": "[fdaa:0:22f0:a7b:ac2:0:2488:2]:5433"}
Today the restarts stopped working as well - either they take a very long time or seem to hang indefinitely. As I write this I’ve been waiting for a restart for > 10 mins for one of the clusters.
Any tips on troubleshooting this?