Repeated Postgres master election

iamliamnorton · June 14, 2022, 9:11am

I have a multi-region postgres setup for my Rails app. I seemed to be working well until about 7 days ago and now I’m getting some strange behaviour. It started with a burst of 1000’s of application exception reports, along the lines of:

ActiveRecord::StatementInvalid PG::ConnectionBad: PQconsumeInput() server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request.
ActiveRecord::ConnectionNotEstablished: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request.
ActionView::Template::Error: PG::ConnectionBad: PQconsumeInput() server closed the connection unexpectedly [Truncated]
ActiveRecord::ConnectionNotEstablished: could not connect to server: Connection refused Is the server running on host “xxxx” and accepting TCP/IP connections on port 5433?

There was a huge surge in these exceptions during a short period. Looking at the Postgres metrics after the fact during this time period looked like memory was fine but CPU was looking a little high. It was running dedicated 2x with 4GB RAM and CPU was mostly under 1x load but some short spikes over 1x.

Since then there’s been lower volume but steady, periodic exceptions reported by the app. I’ve noticed what looks like repeated postgres master re-election. Critical issues reported by running “fly checks list --app my-app” show the following issues:

HTTP GET http://xxx.xxx.xxx.xxx:5500/flycheck/pg: 500 Internal Server Error Output: “failed to connect with local node: context deadline exceeded”[✓]
failed to connect to local node: context deadline exceeded[✓]
HTTP GET http://xxx.xxx.xxx.xxx:5500/flycheck/pg: 500 Internal Server Error Output: "[✓] transactions: readonly (445.44µs)\n[✗]replication: primary mismatch detected: current: "xxx.xxx.xxx.xxx", expected "yyy.yyy.yyy.yyy" (150.53µs)\n[✓]

This seems to be happening in all regions I’m running in (Singapore, Paris, and Dallas) and just continually re-elects a new master.

When the DB is healthy, in the logs, I see a heap of sequential logs like:

WARN cmd/sentinel.go:276 no keeper info available {“db”: “9b058e2e”, “keeper”: “1a0fa222”}"

Before eventually trying a new master election…

cdg [info] sentinel | 2022-06-14T09:03:46.435Z WARN cmd/sentinel.go:276 no keeper info available {“db”: “9b058e2e”, “keeper”: “1a0fa222”}
cdg [info] sentinel | 2022-06-14T09:03:46.435Z WARN cmd/sentinel.go:276 no keeper info available {“db”: “e5a6bdcc”, “keeper”: “12de0fa232”}
cdg [info] sentinel | 2022-06-14T09:03:46.438Z INFO cmd/sentinel.go:995 master db is failed {“db”: “9b058e2e”, “keeper”: “1a0fa222”}
cdg [info] sentinel | 2022-06-14T09:03:46.438Z INFO cmd/sentinel.go:1006 trying to find a new master to replace failed master
cdg [info] sentinel | 2022-06-14T09:03:46.438Z INFO cmd/sentinel.go:741 ignoring keeper since it cannot be master (–can-be-master=false) {“db”: “aceaa353”, “keeper”: “ead0fa442”}
cdg [info] sentinel | 2022-06-14T09:03:46.438Z INFO cmd/sentinel.go:741 ignoring keeper since it cannot be master (–can-be-master=false) {“db”: “7a01dd56”, “keeper”: “5bd40fa432”}
cdg [info] sentinel | 2022-06-14T09:03:46.438Z INFO cmd/sentinel.go:1032 electing db as the new master {“db”: “e5a6bdcc”, “keeper”: “12de0fa232”}

And then a heap of other errors while a new master is attempted to be re-elected.

A couple of hours ago, I scaled up the postgres VMs to dedicated 4x with 8GB RAM and that doesn’t seem to have helped much (if at all).

Can someone help me understand what is going on here please?

teamon · June 14, 2022, 10:29am

We are observing similar errors that started a week ago.

db logs:

2022-06-14T10:27:47Z app[c00d6e7f] iad [info]sentinel | 2022-06-14T10:27:47.317Z        WARN    cmd/sentinel.go:276     no keeper info available        {"db": "04fe1d40", "keeper": "775b0da4e2"}
...
2022-06-14T10:28:29Z app[60ac9f0a] iad [info]sentinel | 2022-06-14T10:28:29.292Z        ERROR   cmd/sentinel.go:102     election loop error     {"error": "failed to acquire lock: Unexpected response code: 500 (rpc error making call: invalid session \"bc5caf3a-8796-a077-a6d9-b525a220dd79\")"}

kurt · June 14, 2022, 1:12pm

We’re looking at these. The logs make it look like something is shutting them down. Have you all made any config changes to your postgres recently? I’m guessing not, but I’d like to be sure.

kurt · June 14, 2022, 2:08pm

We just found a bug that could have been triggering this for some postgres clusters in the US. We think we might have fixed it, can you keep watching and let us know if this improves?

teamon · June 14, 2022, 3:02pm

I’ve bumped RAM yesterday but it didn’t fix it. Prior to that the last change was 2 months ago.
I’ll keep watching logs/errors and let you know, thanks!

iamliamnorton · June 19, 2022, 2:30am

The original issue seems to have been resolved since your bug fix. Any more information you can share about what the bug/fix was?

There seems to be an issue with the health check for the master still though. Unsure whether it’s related.

The master reports a health check as criitical, but it looks a little off.

vm | critical | 03942d17 | dfw | HTTP | 1m7s ago

HTTP GET http://xxx.xxx.xxx.xxx:5500/flycheck/vm: 500 Internal Server Error Output: “[✓]
checkDisk: 45.65 GB (93.2%!)(MISSING) free space on /data/ (34.27µs)\n[✓]
checkLoad: load averages: 1.12 1.00 1.02 (57.12µs)\n[✓]
memory: system spent 0s of the last 60s waiting on memory (32.32µs)\n[✗]cpu: system spent 1.61s of the last 10 seconds waiting on cpu (24.43µs)\n[✓]
io: system spent 0s of the last 60s waiting on io (21.94µs)”[✓]

There is an unusual 500 error being reported when the health check is hitting the /flycheck/vm endpoint, and the memory check seems to fail – but it’s not obvious to me why.

Whistler · June 19, 2022, 8:28am

On the subject of HTTP health checks, which may (TBC) be related to part of your query (the 500 Internal Server Error)…

I have a custom HTTP health check configured on an app and it appears that if it fails once it becomes - and remains - “critical”, regardless of the current state of the health check. It doesn’t seem to reset back from critical, even though I can subsequently see HTTP 200s for the health checks (and a tick next to it in flyctl checks list). This may only be a reporting, or my misunderstanding the output, issue but I’m not sure if HTTP health check status can yet be relied upon for the purpose of auto-restarting apps.

That is, if/when the CPU issue is resolved - you may still find the status as “critical” due to the HTTP check (even if the health check is passing - TBC).

iamliamnorton · June 19, 2022, 10:35am

My last comment is an extension of my issue of the postgres cluster being unhealthy and continually restarting. The restarting has been fixed but the health check doesn’t seem correct.

That’s an interesting observation that subsequent 200s don’t “fix” the critical status. It would be great to hear from Fly about why that is and if there’s a problem there.

My remaining issue with health checks is that I’m getting a 500 / critical from my writer/master postgres instance which I can’t understand. The failing check looks to be from this line:

memory: system spent 0s of the last 60s waiting on memory (29.4µs)\n[✗]cpu: system spent 2.48s of the last 10 seconds waiting on cpu (22.1µs)\n[✓]

So there’s no new line / line break so it’s a bit confusing, but the way I read that is the memory check is failing as each line from the command “fly checks list --app my-postgres” is suffixed with the cross/tick… but the info related with the failing memory check doesn’t indicate an issue.

I’d love to hear from Fly about what is going on here.

Whistler · June 19, 2022, 11:21am

Re: “I’m getting a 500 / critical from my writer/master postgres instance which I can’t understand” - have you observed a HTTP 500 by (e.g.) trying the health-check locally/within the db’s ssh console (flyctl ssh console --app <db name> --select, then curl -v <URL from instance’s health-check>)?

In my case (and possibly yours) it is, with HTTP health checks (which I think is what you’re looking at), just a reporting problem - your HTTP 500 problem may not (TBC) currently exist. It could be that the flyctl checks list status HTTP check’s status is “stuck”/not updating correctly.

My flyctl checks lists shows a HTTP health check as “critical” and the OUTPUT is showing what I assume to be the previous error (with a tick at the end of it), which I guess was seen at the LAST UPDATED time. The status code I see in OUTPUT does not reflect the current state of the health check - your db’s HTTP health check may not be returning a HTTP 500.

danwetherald · September 19, 2024, 4:16am

We are still seeing these issues as well, we have seen these errors on multiple projects over the last few years. Any update from the fly team?

Health check for your postgres vm has failed. Your instance has hit resource limits. Upgrading your instance / volume size or reducing your usage might help.
[✗] cpu: system spent 2.85s of the last 10 seconds waiting on cpu (64.01µs)

Health check for your postgres vm has failed. Your instance has hit resource limits. Upgrading your instance / volume size or reducing your usage might help.
[✗] cpu: system spent 1.37s of the last 10 seconds waiting on cpu (32.04µs)
[✗] io: system spent 1.7s of the last 10 seconds waiting on io (16.57µs)

Elder · September 19, 2024, 5:57am

danwetherald:

Health check for your postgres vm has failed. Your instance has hit resource limits. Upgrading your instance / volume size or reducing your usage might help.
[✗] cpu: system spent 2.85s of the last 10 seconds waiting on cpu (64.01µs)

hey @danwetherald

this looks like your postgres is trying to process more queries than it can handle.

There are a few options:

Upgrade your instance (better CPU, more RAM)
Send read queries to read only replicas for load balancing
Optimize your queries

This will log queries that take more than a second (1000 milliseconds) to run:
fly postgres config -a <app-db> update --log-min-duration-statement 1000

You’ll see long queries in the db log stream. I’d fix those and then set this param to 500 milliseconds to catch queries that take more than 0.5 seconds

jaeyson · October 23, 2024, 12:23pm

I too experienced the same thing like @teamon and OP. It’s returning DBConnection.ConnectionError in fly live logs:

and I didn’t encounter this few days/weeks/months ago.

Topic		Replies	Views
PQconsumeInput() server closed the connection unexpectedly Questions / Help postgres	3	1500	July 14, 2023
Losing Postgres connection on deploy of Rails app	3	598	September 14, 2023
Recurring error since latest service interruption incident Questions / Help postgres , rails	6	94	December 6, 2024
Getting PG::ConnectionBad: PQconsumeInput() server closed the connection unexpectedly in a Rails Sidekiq job Questions / Help postgres	3	1338	February 16, 2024
Unable to reach postgres instance postgres , proxy	31	418	October 24, 2024

Repeated Postgres master election

Related topics