Long Postgres outage (consul i/o timeout)

Hi!

Yesterday my application faced two prolonged outages, due to the postgres cluster being down/unavailable (2 nodes, “standard fly.io stolon” setup). The cluster is running in the “ams” region.

The issue started 2023-02-08 10:54:46 UTC.

[1;38;5;3msentinel | 2023-02-08T10:54:46.239Z	ERROR	cmd/sentinel.go:102	election loop error: failed to acquire lock: Unexpected response code: 500 (invalid session <session id>")
[1;38;5;4mproxy    | [WARNING] 038/105445 (575) : Server bk_db/pg1 was DOWN and now enters maintenance (DNS timeout status).
[1;38;5;4mproxy    | [WARNING] 038/105445 (575) : Server bk_db/pg2 is going DOWN for maintenance (DNS timeout status). 0 active and 0 backup servers left. 3 sessions active, 0 requeued, 0 remaining in queue.
[1;38;5;4mproxy    | [ALERT] 038/105445 (575) : backend 'bk_db' has no server available!
[1;38;5;3msentinel | panic: close of closed channel
[1;38;5;3msentinel | goroutine 501415 [running]:
[1;38;5;3msentinel | github.com/superfly/leadership.(*Candidate).initLock(0xc00035cbd0)
[1;38;5;3msentinel | 	/go/pkg/mod/github.com/superfly/leadership@v0.2.1/candidate.go:98 +0x2e
[1;38;5;3msentinel | github.com/superfly/leadership.(*Candidate).campaign(0xc00035cbd0)
[1;38;5;3msentinel | 	/go/pkg/mod/github.com/superfly/leadership@v0.2.1/candidate.go:124 +0xc6
[1;38;5;3msentinel | created by github.com/superfly/leadership.(*Candidate).RunForElection
[1;38;5;3msentinel | 	/go/pkg/mod/github.com/superfly/leadership@v0.2.1/candidate.go:60 +0xc5
[1;38;5;3msentinel | exit status 2
[1;38;5;3msentinel | restarting in 3s [attempt 146]
[1;38;5;3msentinel | Running...

This is followed by 10 minutes of (and we still see this today, and before the outage, multiple times throughout the day):

[1;38;5;3msentinel | 2023-02-08T10:55:15.529Z	WARN	cmd/sentinel.go:276	no keeper info available

At 11:04 we get:

[1;38;5;3msentinel | 2023-02-08T11:04:00.283Z	ERROR	cmd/sentinel.go:1852	error retrieving cluster data
Get \"https://consul-fra-4.fly-shared.net/v1/kv/<ids>/clusterdata?consistent=&wait=5000ms\": dial tcp: lookup consul-fra-4.fly-shared.net on [<ipv6 short>]:53: read udp [<ipv6>]:56007->[<ipv6 short>]:53: i/o timeout
[1;38;5;3msentinel | 2023-02-08T11:04:10.294Z	ERROR	cmd/sentinel.go:1852	error retrieving cluster data (same GET against consul failure)
[1;38;5;3msentinel | 2023-02-08T11:04:10.295Z	ERROR	cmd/sentinel.go:102	election loop error:
Put \"https://consul-fra-4.fly-shared.net/v1/session/create?wait=5000ms\": dial tcp: lookup consul-fra-4.fly-shared.net on [<ipv6 short>]:53: read udp [<ipv6>]:55529->[<ipv6 short>]:53: i/o timeout
[1;38;5;4mproxy    | [WARNING] 038/110433 (575) : Server bk_db/pg1 administratively READY thanks to valid DNS answer.
[1;38;5;4mproxy    | [WARNING] 038/110433 (575) : Server bk_db/pg2 ('ams.<db>.internal') is UP/READY (resolves again).
[1;38;5;4mproxy    | [WARNING] 038/110433 (575) : Server bk_db/pg2 administratively READY thanks to valid DNS answer.
[1;38;5;4mproxy    | [WARNING] 038/110434 (575) : Server bk_db/pg1 is DOWN, reason: Layer7 invalid response, info: \"HTTP content check did not match\", check duration: 246ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

At 11:10, this loop happens again:

[1;38;5;3msentinel | 2023-02-08T11:05:16.516Z	ERROR	cmd/sentinel.go:1852	error retrieving cluster data (...)
[1;38;5;4mproxy    | [WARNING] 038/110523 (575) : Server bk_db/pg1 was DOWN and now enters maintenance (DNS timeout status).
[1;38;5;4mproxy    | [WARNING] 038/110523 (575) : Server bk_db/pg2 is going DOWN for maintenance (DNS timeout status). 0 active and 0 backup servers left. 2 sessions active, 0 requeued, 0 remaining in queue.
[1;38;5;4mproxy    | [ALERT] 038/110523 (575) : backend 'bk_db' has no server available!
[1;38;5;3msentinel | 2023-02-08T11:05:31.521Z	ERROR	cmd/sentinel.go:1852	error retrieving cluster data
[1;38;5;2mkeeper   | 2023-02-08T11:07:04.753Z	ERROR	cmd/keeper.go:870	failed to update keeper info

It continues like this until 12:24, with the final messages:

[1;38;5;4mproxy    | [WARNING] 038/122446 (575) : Server bk_db/pg1 is DOWN, reason: Layer7 invalid response, info: \"HTTP content check did not match\", check duration: 76ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[1;38;5;3msentinel | 2023-02-08T12:25:06.970Z	WARN	cmd/sentinel.go:276	no keeper info available

Then continuing with “no keeper info available” every 10 seconds until 13:04, when the postgres finally attempts starting up again:

Starting init (commit: 81d5330)...
...
[1;38;5;5mexporter | INFO[0000] Starting Server: :9187                        source=\"postgres_exporter.go:1837\"
[1;38;5;2mkeeper   | 2023-02-08T13:04:50.563Z	ERROR	cmd/keeper.go:719	cannot get configured pg parameters: dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory
[1;38;5;3msentinel | 2023-02-08T13:04:51.228Z	ERROR	cmd/sentinel.go:1895	cannot get keepers info: unexpected end of JSON input
...
[1;38;5;4mproxy    | [WARNING] 038/131151 (579) : Server bk_db/pg1 was DOWN and now enters maintenance (No IP for server ).

It’s a bit unclear to me what went wrong here. I understand that fly does not offer a managed postgres solution, but this does not seem like an issue with one specific postgres machine. Was there a wider issue with the ams region at the time (I cannot see anything reported)? It seems that both my leader and replica went down for around 30 minutes and then another 50 minutes, rendering my application unusable. It did not attempt to switch from leader to replica either.

I had similar issues today also at ams region lasted for about 20 minutes or so, 3 node setup, not touched for months

ERRO[8746493] Error opening connection to database (postgresql://
WARN	cmd/sentinel.go:276	no keeper info available
[WARNING] 040/111827 (571) : Server bk_db/pg1 is DOWN, reason: Layer4 timeout, check duration: 2001ms. 0 active and 0 backup servers left
[WARNING] 040/111800 (571) : Server bk_db/pg2 is DOWN, reason: Layer7 invalid response, info: "HTTP content check did not match"
[WARNING] 040/111804 (571) : Server bk_db/pg3 is DOWN, reason: Layer7 timeout, check duration: 5005ms.

Hi @nouzlin,

A single application-host server in ams did experience a hardware failure, entering a degraded state (from 10:54-11:20 and 12:17-13:03- roughly matching the timestamps you mentioned), eventually requiring a manual hard-reboot of the server to fully recover. This was not a broader issue with the ams region as a whole, just a server hosting one of the two instances comprising your postgres cluster.

I can’t speak to why the Postgres cluster-management setup didn’t trigger an automatic failover to the replica in your case. I do know that we’ve been actively working on a new iteration of Postgres clustering in an attempt to address some known reliability/stability issues with the current clustering setup.

@Elder, a separate host in ams experienced a much shorter service interruption from 2023-02-10T11:07:00Z2023-02-10T11:20:00Z, that probably affected a single instance in your cluster.

1 Like

Hi @wjordan, thanks for response

the time matches. But I have a cluster of 3 nodes. 1-2 nodes showed 1 health check as critical. Seems that the leader couldn’t be chosen.

Shouldn’t multi node setup prevent downtime from the cases when one server is down?

Instances
ID      	PROCESS	VERSION	REGION	DESIRED	STATUS                	HEALTH CHECKS                 	RESTARTS	CREATED
aa9bf0xx	app    	13     	ams   	run    	running (replica)     	3 total, 3 passing            	0       	2023-02-08T13:03:57Z
abd815xx	app    	13     	ams   	run    	running (Get "http://)	3 total, 2 passing, 1 critical	0       	2022-10-16T08:09:24Z
ebd9eexx	app    	13     	ams   	run    	running (replica)     	3 total, 2 passing, 1 critical	0       	2022-10-16T08:09:24Z