Troubleshooting failing managed pg?

I’ve had a problem twice now where our Postgres installation suddenly fails, without any configuration changes. One of the checks fails with:

500 Internal Server Error Output: "failed to connect to proxy: context deadline exceeded"

And nothing in the logs looks any different:

2022-04-01T14:00:01.197 app[A] sea [info]keeper   | 2022-04-01T14:00:01.196Z	INFO	cmd/keeper.go:1576	already standby
2022-04-01T14:00:01.215 app[A] sea [info]keeper   | 2022-04-01T14:00:01.214Z	INFO	cmd/keeper.go:1676	postgres parameters not changed
2022-04-01T14:00:01.215 app[A] sea [info]keeper   | 2022-04-01T14:00:01.215Z	INFO	cmd/keeper.go:1703	postgres hba entries not changed
2022-04-01T14:00:01.979 app[A] sea [info]sentinel | 2022-04-01T14:00:01.979Z	WARN	cmd/sentinel.go:276	no keeper info available	{"db": "b09ce767", "keeper": "ab20586b2"}
2022-04-01T14:00:01.980 app[A] sea [info]sentinel | 2022-04-01T14:00:01.979Z	WARN	cmd/sentinel.go:276	no keeper info available	{"db": "ea67901a", "keeper": "2d30058692"}
2022-04-01T14:00:02.192 app[B] sea [info]keeper   | 2022-04-01T14:00:02.191Z	INFO	cmd/keeper.go:1505	our db requested role is master
2022-04-01T14:00:02.192 app[B] sea [info]keeper   | 2022-04-01T14:00:02.192Z	INFO	cmd/keeper.go:1543	already master
2022-04-01T14:00:02.209 app[B] sea [info]keeper   | 2022-04-01T14:00:02.208Z	INFO	cmd/keeper.go:1676	postgres parameters not changed
2022-04-01T14:00:02.209 app[B] sea [info]keeper   | 2022-04-01T14:00:02.209Z	INFO	cmd/keeper.go:1703	postgres hba entries not changed
2022-04-01T14:00:06.434 app[A] sea [info]keeper   | 2022-04-01T14:00:06.434Z	INFO	cmd/keeper.go:1557	our db requested role is standby	{"followedDB": "437340bf"}
2022-04-01T14:00:06.434 app[A] sea [info]keeper   | 2022-04-01T14:00:06.434Z	INFO	cmd/keeper.go:1576	already standby
2022-04-01T14:00:06.453 app[A] sea [info]keeper   | 2022-04-01T14:00:06.453Z	INFO	cmd/keeper.go:1676	postgres parameters not changed
2022-04-01T14:00:06.454 app[A] sea [info]keeper   | 2022-04-01T14:00:06.453Z	INFO	cmd/keeper.go:1703	postgres hba entries not changed
2022-04-01T14:00:07.326 app[B] sea [info]keeper   | 2022-04-01T14:00:07.325Z	INFO	cmd/keeper.go:1505	our db requested role is master
2022-04-01T14:00:07.327 app[B] sea [info]keeper   | 2022-04-01T14:00:07.326Z	INFO	cmd/keeper.go:1543	already master
2022-04-01T14:00:07.342 app[B] sea [info]keeper   | 2022-04-01T14:00:07.341Z	INFO	cmd/keeper.go:1676	postgres parameters not changed
2022-04-01T14:00:07.342 app[B] sea [info]keeper   | 2022-04-01T14:00:07.342Z	INFO	cmd/keeper.go:1703	postgres hba entries not changed
2022-04-01T14:00:07.886 app[A] sea [info]sentinel | 2022-04-01T14:00:07.885Z	WARN	cmd/sentinel.go:276	no keeper info available	{"db": "b09ce767", "keeper": "ab20586b2"}
2022-04-01T14:00:07.886 app[A] sea [info]sentinel | 2022-04-01T14:00:07.885Z	WARN	cmd/sentinel.go:276	no keeper info available	{"db": "ea67901a", "keeper": "2d30058692"}
2022-04-01T14:00:11.607 app[A] sea [info]keeper   | 2022-04-01T14:00:11.607Z	INFO	cmd/keeper.go:1576	already standby
 role is standby	{"followedDB": "437340bf"}

Other than the failing check, the only hint I could find that anything was wrong was a gradually increasing replication lag in the metrics page.

The only thing I could find in the forums was someone who had tried to scale their PG install from 2 → 1 instance, but in our case no changes have been made at all with no recent restarts.

So far restarting the app has fixed the problem, but if this happens again, do you have any ideas what I should look into in order to troubleshoot the problem?

This is related to a network outage in Seattle yesterday. Do you remember when it happened last?

We’ve been chasing the lingering effects of yesterday’s outage down all day. I do not think this specific problem will happen again.