Postgres VM connections hang

I have two Postgres clusters with Fly, each with 2 VMs in a single region (sea). I haven’t done any special configuration, just flyctl postgres create and then I was off to the races (I love Fly).

Recently (starting earlier this week) I started having connection problems with both clusters. The connections just hang until I restart the VMs manually with flyctl restart. The associated volumes are fine. I’m connecting with the .internal addresses and get the same results whether my server’s trying to connect or I’m trying to connect locally using Wireguard.

I see a lot of critical health checks and restarts when I run flyctl status:

8a178c3e 0       sea    run     running (rpc error: c) 3 total, 3 critical 4        2021-08-12T14:54:44Z
bff97a3f 0       sea    run     running (replica)      3 total, 3 passing  6        2021-08-12T00:47:13Z

The logs consist of the following messages over and over, not sure if these are normal or not:

2021-08-14T16:06:25.097404391Z app[8a178c3e] sea [info] sentinel | 2021-08-14T16:02:14.561Z	WARN	cmd/sentinel.go:276	no keeper info available	{"db": "8e991ba6", "keeper": "ac4024872"}
2021-08-14T16:06:25.097417606Z app[8a178c3e] sea [info] keeper   | 2021-08-14T16:02:15.871Z	INFO	cmd/keeper.go:1505	our db requested role is master
2021-08-14T16:06:25.097430727Z app[8a178c3e] sea [info] keeper   | 2021-08-14T16:02:15.873Z	INFO	cmd/keeper.go:1543	already master
2021-08-14T16:06:25.097493740Z app[8a178c3e] sea [info] keeper   | 2021-08-14T16:02:15.893Z	INFO	cmd/keeper.go:1676	postgres parameters not changed
2021-08-14T16:06:25.097500838Z app[8a178c3e] sea [info] keeper   | 2021-08-14T16:02:15.894Z	INFO	cmd/keeper.go:1703	postgres hba entries not changed
2021-08-14T16:06:25.097514737Z app[8a178c3e] sea [info] proxy    | 2021-08-14T16:02:19.608Z	INFO	cmd/proxy.go:268	master address	{"address": "[fdaa:0:22f0:a7b:ac2:0:2488:2]:5433"}
2021-08-14T16:06:25.097517812Z app[8a178c3e] sea [info] proxy    | 2021-08-14T16:02:19.885Z	INFO	cmd/proxy.go:286	proxying to master address	{"address": "[fdaa:0:22f0:a7b:ac2:0:2488:2]:5433"}

Today the restarts stopped working as well - either they take a very long time or seem to hang indefinitely. As I write this I’ve been waiting for a restart for > 10 mins for one of the clusters.

Any tips on troubleshooting this?

We’re debugging the rpc errors, these seem to happen when the VM hangs. They might be related to RAM, they seem to happen mostly on 256mb instances.

The “best” fix is to run fly vm stop <id> on the offending instance. This is something we’re trying to automate but it’s a little finicky to get right, and we’d rather solve the bug.

Stopping helped, new instances are accepting connections at least for now. Thanks!

Re: RAM issues, I don’t think that would be my problem. Both clusters use 2 dedicated-cpu-1x w/ 2GB of ram and are currently barely used - we’re prelaunch with this project. Memory usage seems to hang out around ~15% and I’m not seeing any spikes.

Is there a way I can stay up to date with progress on the bugfix?