Long Postgres outage (consul i/o timeout)

wjordan · February 11, 2023, 12:26am

A single application-host server in ams did experience a hardware failure, entering a degraded state (from 10:54-11:20 and 12:17-13:03- roughly matching the timestamps you mentioned), eventually requiring a manual hard-reboot of the server to fully recover. This was not a broader issue with the ams region as a whole, just a server hosting one of the two instances comprising your postgres cluster.

I can’t speak to why the Postgres cluster-management setup didn’t trigger an automatic failover to the replica in your case. I do know that we’ve been actively working on a new iteration of Postgres clustering in an attempt to address some known reliability/stability issues with the current clustering setup.

@Elder, a separate host in ams experienced a much shorter service interruption from 2023-02-10T11:07:00Z→2023-02-10T11:20:00Z, that probably affected a single instance in your cluster.