It appears today at ~09:43 UTC my 2xVM/replica Postgres instances (LHR), which have been running fine since March, took a turn for the worse .
Would it be possible for someone from Fly to have a look and see what may have triggered this? (âcmd/sentinel.go:1009 no eligible mastersâ/etc). Iâve emailed the support address with the app name.
Note: I havenât been making any changes, so donât believe itâs something Iâve done - so far, theyâve âjust workedâ so I havenât had to touch them since inception. Theyâre currently in the same state theyâve been in since this morning.
Possibly slightly concerning - /data on one of them is using only Megabytes vs Gigabytes on the other, and that same VMâs /data/dbstate - âInitializingâ:true?
Also, tangentially related to my (unanswered) other query - the metrics for these two VMs are showing (since the issue) their load averages pegged at 1. However, unlike the previous query - neither the load average or CPU usage within the VMs reflects this at all (i.e. v.low CPU/load averages of 0). Is the Firecracker Load Average metric reliable (possibly this may just be a problem with shared CPUs)?
Do the logs say anything about âtoo many clientsâ? This type of thing can happen if Postgres runs out of connection slots.
The quick thing to do here might be to run fly restart -a <db-name> and see if it helps. Then checks the metrics history to see what the connection count looked like.
There is currently showing âFATAL: sorry, too many clients alreadyâ in the Monitoring logs; are the connections automatically/normally cleared down, as there shouldnât currently be anything connected to them (other than inter-VM comms between the two of them, and TCP health-checks).
I was hesitant to restart the app, lest any evidence as to the root cause disappear.
Thatâs basically up to postgres, itâs possible something else went sideways and caused connections to back up. Either way, a restart is the right thing to do to troubleshoot.
If you run fly image update itâll get you a version of postgres with less chatty logs, too, so if this happens again you may have better luck seeing whatâs up.