Repeated Postgres master election

I have a multi-region postgres setup for my Rails app. I seemed to be working well until about 7 days ago and now I’m getting some strange behaviour. It started with a burst of 1000’s of application exception reports, along the lines of:

  • ActiveRecord::StatementInvalid PG::ConnectionBad: PQconsumeInput() server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request.
  • ActiveRecord::ConnectionNotEstablished: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request.
  • ActionView::Template::Error: PG::ConnectionBad: PQconsumeInput() server closed the connection unexpectedly [Truncated]
  • ActiveRecord::ConnectionNotEstablished: could not connect to server: Connection refused Is the server running on host “xxxx” and accepting TCP/IP connections on port 5433?

There was a huge surge in these exceptions during a short period. Looking at the Postgres metrics after the fact during this time period looked like memory was fine but CPU was looking a little high. It was running dedicated 2x with 4GB RAM and CPU was mostly under 1x load but some short spikes over 1x.

Since then there’s been lower volume but steady, periodic exceptions reported by the app. I’ve noticed what looks like repeated postgres master re-election. Critical issues reported by running “fly checks list --app my-app” show the following issues:

  • HTTP GET http://xxx.xxx.xxx.xxx:5500/flycheck/pg: 500 Internal Server Error Output: “failed to connect with local node: context deadline exceeded”[✓]
  • failed to connect to local node: context deadline exceeded[✓]
    HTTP GET http://xxx.xxx.xxx.xxx:5500/flycheck/pg: 500 Internal Server Error Output: "[✓] transactions: readonly (445.44µs)\n[✗]replication: primary mismatch detected: current: “xxx.xxx.xxx.xxx”, expected “yyy.yyy.yyy.yyy” (150.53µs)\n[✓]

This seems to be happening in all regions I’m running in (Singapore, Paris, and Dallas) and just continually re-elects a new master.

When the DB is healthy, in the logs, I see a heap of sequential logs like:

WARN cmd/sentinel.go:276 no keeper info available {“db”: “9b058e2e”, “keeper”: “1a0fa222”}"

Before eventually trying a new master election…

cdg [info] sentinel | 2022-06-14T09:03:46.435Z WARN cmd/sentinel.go:276 no keeper info available {“db”: “9b058e2e”, “keeper”: “1a0fa222”}
cdg [info] sentinel | 2022-06-14T09:03:46.435Z WARN cmd/sentinel.go:276 no keeper info available {“db”: “e5a6bdcc”, “keeper”: “12de0fa232”}
cdg [info] sentinel | 2022-06-14T09:03:46.438Z INFO cmd/sentinel.go:995 master db is failed {“db”: “9b058e2e”, “keeper”: “1a0fa222”}
cdg [info] sentinel | 2022-06-14T09:03:46.438Z INFO cmd/sentinel.go:1006 trying to find a new master to replace failed master
cdg [info] sentinel | 2022-06-14T09:03:46.438Z INFO cmd/sentinel.go:741 ignoring keeper since it cannot be master (–can-be-master=false) {“db”: “aceaa353”, “keeper”: “ead0fa442”}
cdg [info] sentinel | 2022-06-14T09:03:46.438Z INFO cmd/sentinel.go:741 ignoring keeper since it cannot be master (–can-be-master=false) {“db”: “7a01dd56”, “keeper”: “5bd40fa432”}
cdg [info] sentinel | 2022-06-14T09:03:46.438Z INFO cmd/sentinel.go:1032 electing db as the new master {“db”: “e5a6bdcc”, “keeper”: “12de0fa232”}

And then a heap of other errors while a new master is attempted to be re-elected.

A couple of hours ago, I scaled up the postgres VMs to dedicated 4x with 8GB RAM and that doesn’t seem to have helped much (if at all).

Can someone help me understand what is going on here please?

1 Like

We are observing similar errors that started a week ago.

db logs:

2022-06-14T10:27:47Z app[c00d6e7f] iad [info]sentinel | 2022-06-14T10:27:47.317Z        WARN    cmd/sentinel.go:276     no keeper info available        {"db": "04fe1d40", "keeper": "775b0da4e2"}
...
2022-06-14T10:28:29Z app[60ac9f0a] iad [info]sentinel | 2022-06-14T10:28:29.292Z        ERROR   cmd/sentinel.go:102     election loop error     {"error": "failed to acquire lock: Unexpected response code: 500 (rpc error making call: invalid session \"bc5caf3a-8796-a077-a6d9-b525a220dd79\")"}
1 Like

We’re looking at these. The logs make it look like something is shutting them down. Have you all made any config changes to your postgres recently? I’m guessing not, but I’d like to be sure.

We just found a bug that could have been triggering this for some postgres clusters in the US. We think we might have fixed it, can you keep watching and let us know if this improves?

I’ve bumped RAM yesterday but it didn’t fix it. Prior to that the last change was 2 months ago.
I’ll keep watching logs/errors and let you know, thanks!

The original issue seems to have been resolved since your bug fix. Any more information you can share about what the bug/fix was?

There seems to be an issue with the health check for the master still though. Unsure whether it’s related.

The master reports a health check as criitical, but it looks a little off.

vm | critical | 03942d17 | dfw | HTTP | 1m7s ago

HTTP GET http://xxx.xxx.xxx.xxx:5500/flycheck/vm: 500 Internal Server Error Output: “[✓]
checkDisk: 45.65 GB (93.2%!)(MISSING) free space on /data/ (34.27µs)\n[✓]
checkLoad: load averages: 1.12 1.00 1.02 (57.12µs)\n[✓]
memory: system spent 0s of the last 60s waiting on memory (32.32µs)\n[✗]cpu: system spent 1.61s of the last 10 seconds waiting on cpu (24.43µs)\n[✓]
io: system spent 0s of the last 60s waiting on io (21.94µs)”[✓]

There is an unusual 500 error being reported when the health check is hitting the /flycheck/vm endpoint, and the memory check seems to fail – but it’s not obvious to me why.

On the subject of HTTP health checks, which may (TBC) be related to part of your query (the 500 Internal Server Error)…

I have a custom HTTP health check configured on an app and it appears that if it fails once it becomes - and remains - “critical”, regardless of the current state of the health check. It doesn’t seem to reset back from critical, even though I can subsequently see HTTP 200s for the health checks (and a tick next to it in flyctl checks list). This may only be a reporting, or my misunderstanding the output, issue but I’m not sure if HTTP health check status can yet be relied upon for the purpose of auto-restarting apps.

That is, if/when the CPU issue is resolved - you may still find the status as “critical” due to the HTTP check (even if the health check is passing - TBC).

My last comment is an extension of my issue of the postgres cluster being unhealthy and continually restarting. The restarting has been fixed but the health check doesn’t seem correct.

That’s an interesting observation that subsequent 200s don’t “fix” the critical status. It would be great to hear from Fly about why that is and if there’s a problem there.

My remaining issue with health checks is that I’m getting a 500 / critical from my writer/master postgres instance which I can’t understand. The failing check looks to be from this line:

memory: system spent 0s of the last 60s waiting on memory (29.4µs)\n[✗]cpu: system spent 2.48s of the last 10 seconds waiting on cpu (22.1µs)\n[✓]

So there’s no new line / line break so it’s a bit confusing, but the way I read that is the memory check is failing as each line from the command “fly checks list --app my-postgres” is suffixed with the cross/tick… but the info related with the failing memory check doesn’t indicate an issue.

I’d love to hear from Fly about what is going on here.

Re: “I’m getting a 500 / critical from my writer/master postgres instance which I can’t understand” - have you observed a HTTP 500 by (e.g.) trying the health-check locally/within the db’s ssh console (flyctl ssh console --app <db name> --select, then curl -v <URL from instance’s health-check>)?

In my case (and possibly yours) it is, with HTTP health checks (which I think is what you’re looking at), just a reporting problem - your HTTP 500 problem may not (TBC) currently exist. It could be that the flyctl checks list status HTTP check’s status is “stuck”/not updating correctly.

My flyctl checks lists shows a HTTP health check as “critical” and the OUTPUT is showing what I assume to be the previous error (with a tick at the end of it), which I guess was seen at the LAST UPDATED time. The status code I see in OUTPUT does not reflect the current state of the health check - your db’s HTTP health check may not be returning a HTTP 500.