"Server bk_db/pg1 is going DOWN for maintenance" continous error. My DB won't go up.

porto · January 9, 2023, 7:44pm

It was working fine but stopped working (maybe on a deploy, this is a staging env). I don’t understand why this happened and I don’t know how to fix it. These are my logs:

2023-01-08T23:26:38Z app[9080116a126787] gru [info]proxy    | [WARNING] 007/232638 (562) : Server bk_db/pg1 ('gru.magic-mail-bot-staging-db.internal') is UP/READY (resolves again).
2023-01-08T23:26:38Z app[9080116a126787] gru [info]proxy    | [WARNING] 007/232638 (562) : Server bk_db/pg1 administratively READY thanks to valid DNS answer.
2023-01-08T23:26:57Z app[9080116a126787] gru [info]sentinel | panic: close of closed channel
2023-01-08T23:26:57Z app[9080116a126787] gru [info]sentinel |
2023-01-08T23:26:57Z app[9080116a126787] gru [info]sentinel | goroutine 4033 [running]:
2023-01-08T23:26:57Z app[9080116a126787] gru [info]sentinel | github.com/superfly/leadership.(*Candidate).initLock(0xc000138b60)
2023-01-08T23:26:57Z app[9080116a126787] gru [info]sentinel | 	/go/pkg/mod/github.com/superfly/leadership@v0.2.1/candidate.go:98 +0x2e
2023-01-08T23:26:57Z app[9080116a126787] gru [info]sentinel | github.com/superfly/leadership.(*Candidate).campaign(0xc000138b60)
2023-01-08T23:26:57Z app[9080116a126787] gru [info]sentinel | 	/go/pkg/mod/github.com/superfly/leadership@v0.2.1/candidate.go:124 +0xc6
2023-01-08T23:26:57Z app[9080116a126787] gru [info]sentinel | created by github.com/superfly/leadership.(*Candidate).RunForElection
2023-01-08T23:26:57Z app[9080116a126787] gru [info]sentinel | 	/go/pkg/mod/github.com/superfly/leadership@v0.2.1/candidate.go:60 +0xc5
2023-01-08T23:26:57Z app[9080116a126787] gru [info]sentinel | exit status 2
2023-01-08T23:26:57Z app[9080116a126787] gru [info]sentinel | restarting in 3s [attempt 16]
2023-01-08T23:27:00Z app[9080116a126787] gru [info]sentinel | Running...
2023-01-09T00:20:47Z app[9080116a126787] gru [info]sentinel | 2023-01-09T00:20:47.502Z	ERROR	cmd/sentinel.go:1947	error saving clusterdata	{"error": "Unexpected response code: 502"}
2023-01-09T01:20:51Z app[9080116a126787] gru [info]proxy    | [WARNING] 008/012051 (562) : Server bk_db/pg1 is going DOWN for maintenance (DNS timeout status). 0 active and 1 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.
2023-01-09T01:20:52Z app[9080116a126787] gru [info]proxy    | [WARNING] 008/012052 (562) : Server bk_db/pg1 ('gru.magic-mail-bot-staging-db.internal') is UP/READY (resolves again).
2023-01-09T01:20:52Z app[9080116a126787] gru [info]proxy    | [WARNING] 008/012052 (562) : Server bk_db/pg1 administratively READY thanks to valid DNS answer.
2023-01-09T02:10:41Z app[9080116a126787] gru [info]proxy    | [WARNING] 008/021041 (562) : Server bk_db/pg1 ('gru.magic-mail-bot-staging-db.internal') is UP/READY (resolves again).
2023-01-09T02:25:15Z app[9080116a126787] gru [info]sentinel | 2023-01-09T02:24:43.325Z	FATAL	cmd/sentinel.go:2030	cannot create sentinel: cannot create store: cannot create kv store: Put "https://consul-iad.fly-shared.net/v1/catalog/register?wait=5000ms": dial tcp: lookup consul-iad.fly-shared.net on [fdaa::3]:53: read udp [fdaa:0:17d0:a7b:1f61:7140:7360:2]:33112->[fdaa::3]:53: i/o timeout
2023-01-09T02:25:25Z app[9080116a126787] gru [info]sentinel | Running...
2023-01-09T02:35:18Z app[9080116a126787] gru [info]sentinel | 2023-01-09T02:33:04.861Z	ERROR	cmd/sentinel.go:102	election loop error	{"error": "Put \"https://consul-iad.fly-shared.net/v1/session/create?wait=5000ms\": dial tcp: lookup consul-iad.fly-shared.net on [fdaa::3]:53: read udp [fdaa:0:17d0:a7b:1f61:7140:7360:2]:46825->[fdaa::3]:53: i/o timeout"}
2023-01-09T02:35:18Z app[9080116a126787] gru [info]keeper   | 2023-01-09T02:33:45.021Z	ERROR	cmd/keeper.go:1041	error retrieving cluster data	{"error": "Get \"https://consul-iad.fly-shared.net/v1/kv/magic-mail-bot-staging-db-yexkqw5yedjqm38d/magic-mail-bot-staging-db/clusterdata?consistent=&wait=5000ms\": dial tcp: lookup consul-iad.fly-shared.net on [fdaa::3]:53: read udp [fdaa:0:17d0:a7b:1f61:7140:7360:2]:37833->[fdaa::3]:53: i/o timeout"}
2023-01-09T02:45:47Z app[9080116a126787] gru [info]sentinel | 2023-01-09T02:45:47.132Z	ERROR	cmd/sentinel.go:102	election loop error	{"error": "Put \"https://consul-iad.fly-shared.net/v1/session/create?wait=5000ms\": dial tcp: lookup consul-iad.fly-shared.net on [fdaa::3]:53: read udp [fdaa:0:17d0:a7b:1f61:7140:7360:2]:49717->[fdaa::3]:53: i/o timeout"}
2023-01-09T02:49:23Z app[9080116a126787] gru [info]sentinel | 2023-01-09T02:49:23.710Z	FATAL	cmd/sentinel.go:2030	cannot create sentinel: cannot create store: cannot create kv store: Put "https://consul-iad.fly-shared.net/v1/catalog/register?wait=5000ms": dial tcp: lookup consul-iad.fly-shared.net on [fdaa::3]:53: read udp [fdaa:0:17d0:a7b:1f61:7140:7360:2]:57977->[fdaa::3]:53: i/o timeout
2023-01-09T02:49:23Z app[9080116a126787] gru [info]sentinel | exit status 1

DAlperin · January 9, 2023, 9:09pm

Hey there! Is this still a problem? If you restart does that fix the issue?

It looks like those timestamps correspond with an issue we had on the host that machine is running on, which should be fixed.

porto · January 11, 2023, 8:49pm

Hi @DAlperin now I understand the issue:
it broke during a deployment and it also broke the machine inside it (which I didn’t know existed). I tried starting the DB instance with no success. I managed to solve it when I started again the broken machine. Then the instance was up again.
IMHO it should automatically try to restart the broken machine if it fails by any reason.

Topic		Replies	Views
PG issues `backend 'bk_db' has no server available!`, FRA region	5	634	October 24, 2022
Unannounced Maintenance?	7	585	February 28, 2023
Postgres down. Again Questions / Help postgres	2	398	February 22, 2023
Production database randomly going down and not recovering	6	1143	July 18, 2022
Can't connect to or restart database	2	184	November 20, 2023

"Server bk_db/pg1 is going DOWN for maintenance" continous error. My DB won't go up.

Related topics