I’ve had a problem twice now where our Postgres installation suddenly fails, without any configuration changes. One of the checks fails with:
500 Internal Server Error Output: "failed to connect to proxy: context deadline exceeded"
And nothing in the logs looks any different:
2022-04-01T14:00:01.197 app[A] sea [info]keeper | 2022-04-01T14:00:01.196Z INFO cmd/keeper.go:1576 already standby
2022-04-01T14:00:01.215 app[A] sea [info]keeper | 2022-04-01T14:00:01.214Z INFO cmd/keeper.go:1676 postgres parameters not changed
2022-04-01T14:00:01.215 app[A] sea [info]keeper | 2022-04-01T14:00:01.215Z INFO cmd/keeper.go:1703 postgres hba entries not changed
2022-04-01T14:00:01.979 app[A] sea [info]sentinel | 2022-04-01T14:00:01.979Z WARN cmd/sentinel.go:276 no keeper info available {"db": "b09ce767", "keeper": "ab20586b2"}
2022-04-01T14:00:01.980 app[A] sea [info]sentinel | 2022-04-01T14:00:01.979Z WARN cmd/sentinel.go:276 no keeper info available {"db": "ea67901a", "keeper": "2d30058692"}
2022-04-01T14:00:02.192 app[B] sea [info]keeper | 2022-04-01T14:00:02.191Z INFO cmd/keeper.go:1505 our db requested role is master
2022-04-01T14:00:02.192 app[B] sea [info]keeper | 2022-04-01T14:00:02.192Z INFO cmd/keeper.go:1543 already master
2022-04-01T14:00:02.209 app[B] sea [info]keeper | 2022-04-01T14:00:02.208Z INFO cmd/keeper.go:1676 postgres parameters not changed
2022-04-01T14:00:02.209 app[B] sea [info]keeper | 2022-04-01T14:00:02.209Z INFO cmd/keeper.go:1703 postgres hba entries not changed
2022-04-01T14:00:06.434 app[A] sea [info]keeper | 2022-04-01T14:00:06.434Z INFO cmd/keeper.go:1557 our db requested role is standby {"followedDB": "437340bf"}
2022-04-01T14:00:06.434 app[A] sea [info]keeper | 2022-04-01T14:00:06.434Z INFO cmd/keeper.go:1576 already standby
2022-04-01T14:00:06.453 app[A] sea [info]keeper | 2022-04-01T14:00:06.453Z INFO cmd/keeper.go:1676 postgres parameters not changed
2022-04-01T14:00:06.454 app[A] sea [info]keeper | 2022-04-01T14:00:06.453Z INFO cmd/keeper.go:1703 postgres hba entries not changed
2022-04-01T14:00:07.326 app[B] sea [info]keeper | 2022-04-01T14:00:07.325Z INFO cmd/keeper.go:1505 our db requested role is master
2022-04-01T14:00:07.327 app[B] sea [info]keeper | 2022-04-01T14:00:07.326Z INFO cmd/keeper.go:1543 already master
2022-04-01T14:00:07.342 app[B] sea [info]keeper | 2022-04-01T14:00:07.341Z INFO cmd/keeper.go:1676 postgres parameters not changed
2022-04-01T14:00:07.342 app[B] sea [info]keeper | 2022-04-01T14:00:07.342Z INFO cmd/keeper.go:1703 postgres hba entries not changed
2022-04-01T14:00:07.886 app[A] sea [info]sentinel | 2022-04-01T14:00:07.885Z WARN cmd/sentinel.go:276 no keeper info available {"db": "b09ce767", "keeper": "ab20586b2"}
2022-04-01T14:00:07.886 app[A] sea [info]sentinel | 2022-04-01T14:00:07.885Z WARN cmd/sentinel.go:276 no keeper info available {"db": "ea67901a", "keeper": "2d30058692"}
2022-04-01T14:00:11.607 app[A] sea [info]keeper | 2022-04-01T14:00:11.607Z INFO cmd/keeper.go:1576 already standby
role is standby {"followedDB": "437340bf"}
Other than the failing check, the only hint I could find that anything was wrong was a gradually increasing replication lag in the metrics page.
The only thing I could find in the forums was someone who had tried to scale their PG install from 2 → 1 instance, but in our case no changes have been made at all with no recent restarts.
So far restarting the app has fixed the problem, but if this happens again, do you have any ideas what I should look into in order to troubleshoot the problem?