Postgres issue/"down"?

Whistler · August 10, 2022, 12:59pm

It appears today at ~09:43 UTC my 2xVM/replica Postgres instances (LHR), which have been running fine since March, took a turn for the worse .

Would it be possible for someone from Fly to have a look and see what may have triggered this? (“cmd/sentinel.go:1009 no eligible masters”/etc). I’ve emailed the support address with the app name.

Note: I haven’t been making any changes, so don’t believe it’s something I’ve done - so far, they’ve “just worked” so I haven’t had to touch them since inception. They’re currently in the same state they’ve been in since this morning.

Possibly slightly concerning - /data on one of them is using only Megabytes vs Gigabytes on the other, and that same VM’s /data/dbstate - “Initializing”:true?

Also, tangentially related to my (unanswered) other query - the metrics for these two VMs are showing (since the issue) their load averages pegged at 1. However, unlike the previous query - neither the load average or CPU usage within the VMs reflects this at all (i.e. v.low CPU/load averages of 0). Is the Firecracker Load Average metric reliable (possibly this may just be a problem with shared CPUs)?

aTdHvAaNnKcSe

kurt · August 10, 2022, 1:25pm

Do the logs say anything about “too many clients”? This type of thing can happen if Postgres runs out of connection slots.

The quick thing to do here might be to run fly restart -a <db-name> and see if it helps. Then checks the metrics history to see what the connection count looked like.

Whistler · August 10, 2022, 1:33pm

There is currently showing “FATAL: sorry, too many clients already” in the Monitoring logs; are the connections automatically/normally cleared down, as there shouldn’t currently be anything connected to them (other than inter-VM comms between the two of them, and TCP health-checks).

I was hesitant to restart the app, lest any evidence as to the root cause disappear.

kurt · August 10, 2022, 1:45pm

That’s basically up to postgres, it’s possible something else went sideways and caused connections to back up. Either way, a restart is the right thing to do to troubleshoot.

If you run fly image update it’ll get you a version of postgres with less chatty logs, too, so if this happens again you may have better luck seeing what’s up.

Whistler · August 10, 2022, 5:17pm

To provide some closure:

Attempted to restart - no dice, one VM remained “stuck”.

Image update, same again (lingering VM).

Scaled to zero, which force killed both, and then back to two - appears to now be working.

The Firecracker Load Average metric remains a mystery/possibly not something to be relied upon.

Topic		Replies	Views
postgres dead postgres	2	371	January 14, 2023
Database is down. Fly dashboard is down Questions / Help postgres	11	278	May 31, 2024
Unable to connect to my postgres instance Questions / Help postgres	6	607	February 22, 2023
Database connection count growing with deploys Questions / Help elixir , postgres	1	685	January 7, 2022
Postgres machines down? Questions / Help postgres	4	529	April 12, 2024

Postgres issue/"down"?

Related topics