Postgres leader check failure

Is there an outage currently? My database cluster died.

$ flyctl --app production-db checks list
Health Checks for production-db
NAME STATUS ALLOCATION REGION TYPE LAST UPDATED OUTPUT
role critical e0f54567 iad HTTP 8s ago context deadline exceeded
pg critical e0f54567 iad HTTP 14s ago HTTP GET
http://172.19.4.170:5500/flycheck/pg:
500 Internal Server Error
Output: “[✗] leader check:
lookup production-db.internal on
[fdaa::3]:53: no such host”
vm warning e0f54567 iad HTTP 18s ago

Logs here

Update: it’s back up now

There isn’t an outage that we are aware of. If you run fly status --all do you see any failed instances? And how long have the ones that are there been running?

Also will you run fly image version and see if it recommends an upgrade?

Hey @enaia

I was able to verify that the Etcd backend store encountered some issues around that time. I’m still investigating the root cause, but things seem to be stable as of right now.

2 Likes

$ fly status --app production-db --all
App
Name = production-db
Owner = enaia
Version = 4
Status = running
Hostname = production-db.fly.dev

Instances
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
fbcf2c92 app 4 iad run running (replica) 3 total, 3 passing 0 29m48s ago
e0f54567 app 4 iad run running (leader) 3 total, 3 passing 0 33m44s ago
c8d852bd app 4 iad stop failed 3 total 2 2021-11-07T19:56:01Z
657286da app 4 iad stop failed 3 total, 2 critical 2 2021-11-07T19:54:02Z

Not sure how to run fly image version?

@enaia I think he mean’t fly image show.