It appears today at ~09:43 UTC my 2xVM/replica Postgres instances (LHR), which have been running fine since March, took a turn for the worse .
Would it be possible for someone from Fly to have a look and see what may have triggered this? (“cmd/sentinel.go:1009 no eligible masters”/etc). I’ve emailed the support address with the app name.
Note: I haven’t been making any changes, so don’t believe it’s something I’ve done - so far, they’ve “just worked” so I haven’t had to touch them since inception. They’re currently in the same state they’ve been in since this morning.
Possibly slightly concerning - /data on one of them is using only Megabytes vs Gigabytes on the other, and that same VM’s /data/dbstate - “Initializing”:true?
Also, tangentially related to my (unanswered) other query - the metrics for these two VMs are showing (since the issue) their load averages pegged at 1. However, unlike the previous query - neither the load average or CPU usage within the VMs reflects this at all (i.e. v.low CPU/load averages of 0). Is the Firecracker Load Average metric reliable (possibly this may just be a problem with shared CPUs)?
Do the logs say anything about “too many clients”? This type of thing can happen if Postgres runs out of connection slots.
The quick thing to do here might be to run
fly restart -a <db-name> and see if it helps. Then checks the metrics history to see what the connection count looked like.
There is currently showing “FATAL: sorry, too many clients already” in the Monitoring logs; are the connections automatically/normally cleared down, as there shouldn’t currently be anything connected to them (other than inter-VM comms between the two of them, and TCP health-checks).
I was hesitant to restart the app, lest any evidence as to the root cause disappear.
That’s basically up to postgres, it’s possible something else went sideways and caused connections to back up. Either way, a restart is the right thing to do to troubleshoot.
If you run
fly image update it’ll get you a version of postgres with less chatty logs, too, so if this happens again you may have better luck seeing what’s up.
To provide some closure:
Attempted to restart - no dice, one VM remained “stuck”.
Image update, same again (lingering VM).
Scaled to zero, which force killed both, and then back to two - appears to now be working.
The Firecracker Load Average metric remains a mystery/possibly not something to be relied upon.