Hi!
Yesterday my application faced two prolonged outages, due to the postgres cluster being down/unavailable (2 nodes, “standard fly.io stolon” setup). The cluster is running in the “ams” region.
The issue started 2023-02-08 10:54:46 UTC.
[1;38;5;3msentinel | 2023-02-08T10:54:46.239Z ERROR cmd/sentinel.go:102 election loop error: failed to acquire lock: Unexpected response code: 500 (invalid session <session id>")
[1;38;5;4mproxy | [WARNING] 038/105445 (575) : Server bk_db/pg1 was DOWN and now enters maintenance (DNS timeout status).
[1;38;5;4mproxy | [WARNING] 038/105445 (575) : Server bk_db/pg2 is going DOWN for maintenance (DNS timeout status). 0 active and 0 backup servers left. 3 sessions active, 0 requeued, 0 remaining in queue.
[1;38;5;4mproxy | [ALERT] 038/105445 (575) : backend 'bk_db' has no server available!
[1;38;5;3msentinel | panic: close of closed channel
[1;38;5;3msentinel | goroutine 501415 [running]:
[1;38;5;3msentinel | github.com/superfly/leadership.(*Candidate).initLock(0xc00035cbd0)
[1;38;5;3msentinel | /go/pkg/mod/github.com/superfly/leadership@v0.2.1/candidate.go:98 +0x2e
[1;38;5;3msentinel | github.com/superfly/leadership.(*Candidate).campaign(0xc00035cbd0)
[1;38;5;3msentinel | /go/pkg/mod/github.com/superfly/leadership@v0.2.1/candidate.go:124 +0xc6
[1;38;5;3msentinel | created by github.com/superfly/leadership.(*Candidate).RunForElection
[1;38;5;3msentinel | /go/pkg/mod/github.com/superfly/leadership@v0.2.1/candidate.go:60 +0xc5
[1;38;5;3msentinel | exit status 2
[1;38;5;3msentinel | restarting in 3s [attempt 146]
[1;38;5;3msentinel | Running...
This is followed by 10 minutes of (and we still see this today, and before the outage, multiple times throughout the day):
[1;38;5;3msentinel | 2023-02-08T10:55:15.529Z WARN cmd/sentinel.go:276 no keeper info available
At 11:04 we get:
[1;38;5;3msentinel | 2023-02-08T11:04:00.283Z ERROR cmd/sentinel.go:1852 error retrieving cluster data
Get \"https://consul-fra-4.fly-shared.net/v1/kv/<ids>/clusterdata?consistent=&wait=5000ms\": dial tcp: lookup consul-fra-4.fly-shared.net on [<ipv6 short>]:53: read udp [<ipv6>]:56007->[<ipv6 short>]:53: i/o timeout
[1;38;5;3msentinel | 2023-02-08T11:04:10.294Z ERROR cmd/sentinel.go:1852 error retrieving cluster data (same GET against consul failure)
[1;38;5;3msentinel | 2023-02-08T11:04:10.295Z ERROR cmd/sentinel.go:102 election loop error:
Put \"https://consul-fra-4.fly-shared.net/v1/session/create?wait=5000ms\": dial tcp: lookup consul-fra-4.fly-shared.net on [<ipv6 short>]:53: read udp [<ipv6>]:55529->[<ipv6 short>]:53: i/o timeout
[1;38;5;4mproxy | [WARNING] 038/110433 (575) : Server bk_db/pg1 administratively READY thanks to valid DNS answer.
[1;38;5;4mproxy | [WARNING] 038/110433 (575) : Server bk_db/pg2 ('ams.<db>.internal') is UP/READY (resolves again).
[1;38;5;4mproxy | [WARNING] 038/110433 (575) : Server bk_db/pg2 administratively READY thanks to valid DNS answer.
[1;38;5;4mproxy | [WARNING] 038/110434 (575) : Server bk_db/pg1 is DOWN, reason: Layer7 invalid response, info: \"HTTP content check did not match\", check duration: 246ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
At 11:10, this loop happens again:
[1;38;5;3msentinel | 2023-02-08T11:05:16.516Z ERROR cmd/sentinel.go:1852 error retrieving cluster data (...)
[1;38;5;4mproxy | [WARNING] 038/110523 (575) : Server bk_db/pg1 was DOWN and now enters maintenance (DNS timeout status).
[1;38;5;4mproxy | [WARNING] 038/110523 (575) : Server bk_db/pg2 is going DOWN for maintenance (DNS timeout status). 0 active and 0 backup servers left. 2 sessions active, 0 requeued, 0 remaining in queue.
[1;38;5;4mproxy | [ALERT] 038/110523 (575) : backend 'bk_db' has no server available!
[1;38;5;3msentinel | 2023-02-08T11:05:31.521Z ERROR cmd/sentinel.go:1852 error retrieving cluster data
[1;38;5;2mkeeper | 2023-02-08T11:07:04.753Z ERROR cmd/keeper.go:870 failed to update keeper info
It continues like this until 12:24, with the final messages:
[1;38;5;4mproxy | [WARNING] 038/122446 (575) : Server bk_db/pg1 is DOWN, reason: Layer7 invalid response, info: \"HTTP content check did not match\", check duration: 76ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[1;38;5;3msentinel | 2023-02-08T12:25:06.970Z WARN cmd/sentinel.go:276 no keeper info available
Then continuing with “no keeper info available” every 10 seconds until 13:04, when the postgres finally attempts starting up again:
Starting init (commit: 81d5330)...
...
[1;38;5;5mexporter | INFO[0000] Starting Server: :9187 source=\"postgres_exporter.go:1837\"
[1;38;5;2mkeeper | 2023-02-08T13:04:50.563Z ERROR cmd/keeper.go:719 cannot get configured pg parameters: dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory
[1;38;5;3msentinel | 2023-02-08T13:04:51.228Z ERROR cmd/sentinel.go:1895 cannot get keepers info: unexpected end of JSON input
...
[1;38;5;4mproxy | [WARNING] 038/131151 (579) : Server bk_db/pg1 was DOWN and now enters maintenance (No IP for server ).
It’s a bit unclear to me what went wrong here. I understand that fly does not offer a managed postgres solution, but this does not seem like an issue with one specific postgres machine. Was there a wider issue with the ams region at the time (I cannot see anything reported)? It seems that both my leader and replica went down for around 30 minutes and then another 50 minutes, rendering my application unusable. It did not attempt to switch from leader to replica either.