Any thoughts? It’s the pgdb com one if you want to restart it at your end.
Logs are full of errors. 502s, and now … 429s? Hmm. The database is not under load, made barely any connections. Like 1/2 app instances connecting to it. I can’t connect to it from home. Says connection refused.
It’s in LHR:
2021-06-23T21:10:54.158364109Z app[9fff7785] lhr [info] keeper | 2021-06-23T21:10:54.154Z ERROR cmd/keeper.go:1010 error retrieving cluster data {"error": "Unexpected response code: 502"}
2021-06-23T21:11:27.163145352Z app[9fff7785] lhr [info] keeper | 2021-06-23T21:11:27.158Z ERROR cmd/keeper.go:839 failed to update keeper info {"error": "cannot set or renew session for ttl, unable to operate on sessions"}
2021-06-23T21:11:39.636908431Z app[0a9a0048] lhr [info] keeper | 2021-06-23T21:11:39.632Z ERROR cmd/keeper.go:839 failed to update keeper info {"error": "cannot set or renew session for ttl, unable to operate on sessions"}
2021-06-23T21:11:56.937007617Z app[9fff7785] lhr [info] keeper | 2021-06-23T21:11:56.933Z ERROR cmd/keeper.go:1010 error retrieving cluster data {"error": "Unexpected response code: 502"}
2021-06-23T21:12:16.994153123Z app[0a9a0048] lhr [info] keeper | 2021-06-23T21:12:16.989Z ERROR cmd/keeper.go:839 failed to update keeper info {"error": "cannot set or renew session for ttl, unable to operate on sessions"}
2021-06-23T21:12:20.331299106Z app[9fff7785] lhr [info] keeper | 2021-06-23T21:12:20.326Z ERROR cmd/keeper.go:839 failed to update keeper info {"error": "cannot set or renew session for ttl, unable to operate on sessions"}
2021-06-23T21:12:22.745991358Z app[0a9a0048] lhr [info] keeper | 2021-06-23T21:12:22.741Z ERROR cmd/keeper.go:839 failed to update keeper info {"error": "cannot set or renew session for ttl, unable to operate on sessions"}
2021-06-23T21:12:26.073285497Z app[9fff7785] lhr [info] keeper | 2021-06-23T21:12:26.069Z ERROR cmd/keeper.go:839 failed to update keeper info {"error": "cannot set or renew session for ttl, unable to operate on sessions"}
2021-06-23T21:12:38.444730807Z app[0a9a0048] lhr [info] keeper | 2021-06-23T21:12:38.439Z ERROR cmd/keeper.go:839 failed to update keeper info {"error": "Unexpected response code: 429 (Your IP is issuing too many concurrent connections, please rate limit your calls\n)"}
2021-06-23T21:17:22.884537834Z app[0a9a0048] lhr [info] keeper | 2021-06-23T21:17:22.880Z ERROR cmd/keeper.go:1010 error retrieving cluster data {"error": "Unexpected response code: 429"}
The flyctl status shows it as failed. Though it shows two instances as running too.
Status = failed
Description = Failed due to unhealthy allocations
Now showing two instances as running and health checks passing. But … not according to the status line above them. Weird. That still shows as failed.
Status = failed
Description = Failed due to unhealthy allocations
Instances = 2 desired, 2 placed, 1 healthy, 1 unhealthy
Instances
ID VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
29d50728 2 lhr run running (leader) 3 total, 3 passing 0 2m5s ago
f5a70632 2 lhr run running (replica) 3 total, 3 passing 0 3m47s ago
Ah. Yes, still shows as failed as the status. It would be good if that was the current status, not the prior one. So that updated on a new restart/deploy.
Did you do flyctl restart?
Is that what I should do? Ideally that would happen automatically if your system could detect the error. I guess it can’t be based on (status === ‘failed’) if that status does not reflect the current status, but some other monitor.
I stopped and restarted each VM, our shared consul service vanished from DNS earlier (different bug), which likely knocked your cluster out. We’ve disabled a lot of auto restart features on the Postgres VMs because normally you don’t want them! But it’s brittle.
We’re actively working on this, the stolon + Consul setup we have is not making us very happy so we’re trying to figure out more reliable options. Stolon errs on the side of “being really, really safe with data” which is good, but its consul implementation isn’t very resilient.
I’m willing you on A reliable and cheap hosted database remains elusive. Seems like you can have one or the other. If you could be both that would be neat.
Hmm. Well, not sure how practical this is, given the numbers, but maybe if you had an automated health test that tries to connect to a database app and run a query, even at the top level to show tables or databases or something (so it didn’t need any cpu/ram and should respond instantly). If it fails the query, restart the db instance? Like you did (stop/start, whatever).
As that fixed it.
That would save human intervention (inevitably at like 2am).
Else yes, it’s a case of the customer (or someone at your end) happening to notice an error and manually fixing it, which isn’t ideal.
That’s how script health checks and restarts work for “normal” fly apps. We just have it disabled for pg apps to prevent flapping restarts breaking failover. In this case restarting still failed until consul came back up. That’s why either fixing or replacing consul for stolon is a high priority.