I have a multi-region postgres setup for my Rails app. I seemed to be working well until about 7 days ago and now I’m getting some strange behaviour. It started with a burst of 1000’s of application exception reports, along the lines of:
- ActiveRecord::StatementInvalid PG::ConnectionBad: PQconsumeInput() server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request.
- ActiveRecord::ConnectionNotEstablished: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request.
- ActionView::Template::Error: PG::ConnectionBad: PQconsumeInput() server closed the connection unexpectedly [Truncated]
- ActiveRecord::ConnectionNotEstablished: could not connect to server: Connection refused Is the server running on host “xxxx” and accepting TCP/IP connections on port 5433?
There was a huge surge in these exceptions during a short period. Looking at the Postgres metrics after the fact during this time period looked like memory was fine but CPU was looking a little high. It was running dedicated 2x with 4GB RAM and CPU was mostly under 1x load but some short spikes over 1x.
Since then there’s been lower volume but steady, periodic exceptions reported by the app. I’ve noticed what looks like repeated postgres master re-election. Critical issues reported by running “fly checks list --app my-app” show the following issues:
- HTTP GET http://xxx.xxx.xxx.xxx:5500/flycheck/pg: 500 Internal Server Error Output: “failed to connect with local node: context deadline exceeded”[✓]
- failed to connect to local node: context deadline exceeded[✓]
HTTP GET http://xxx.xxx.xxx.xxx:5500/flycheck/pg: 500 Internal Server Error Output: "[✓] transactions: readonly (445.44µs)\n[✗]replication: primary mismatch detected: current: "xxx.xxx.xxx.xxx", expected "yyy.yyy.yyy.yyy" (150.53µs)\n[✓]
This seems to be happening in all regions I’m running in (Singapore, Paris, and Dallas) and just continually re-elects a new master.
When the DB is healthy, in the logs, I see a heap of sequential logs like:
WARN cmd/sentinel.go:276 no keeper info available {“db”: “9b058e2e”, “keeper”: “1a0fa222”}"
Before eventually trying a new master election…
cdg [info] sentinel | 2022-06-14T09:03:46.435Z WARN cmd/sentinel.go:276 no keeper info available {“db”: “9b058e2e”, “keeper”: “1a0fa222”}
cdg [info] sentinel | 2022-06-14T09:03:46.435Z WARN cmd/sentinel.go:276 no keeper info available {“db”: “e5a6bdcc”, “keeper”: “12de0fa232”}
cdg [info] sentinel | 2022-06-14T09:03:46.438Z INFO cmd/sentinel.go:995 master db is failed {“db”: “9b058e2e”, “keeper”: “1a0fa222”}
cdg [info] sentinel | 2022-06-14T09:03:46.438Z INFO cmd/sentinel.go:1006 trying to find a new master to replace failed master
cdg [info] sentinel | 2022-06-14T09:03:46.438Z INFO cmd/sentinel.go:741 ignoring keeper since it cannot be master (–can-be-master=false) {“db”: “aceaa353”, “keeper”: “ead0fa442”}
cdg [info] sentinel | 2022-06-14T09:03:46.438Z INFO cmd/sentinel.go:741 ignoring keeper since it cannot be master (–can-be-master=false) {“db”: “7a01dd56”, “keeper”: “5bd40fa432”}
cdg [info] sentinel | 2022-06-14T09:03:46.438Z INFO cmd/sentinel.go:1032 electing db as the new master {“db”: “e5a6bdcc”, “keeper”: “12de0fa232”}
And then a heap of other errors while a new master is attempted to be re-elected.
A couple of hours ago, I scaled up the postgres VMs to dedicated 4x with 8GB RAM and that doesn’t seem to have helped much (if at all).
Can someone help me understand what is going on here please?