Database down again

Hello,

I see my database has stopped working again (e.g https://community.fly.io/t/database-down-again-lhr).

Any thoughts? It’s the pgdb com one if you want to restart it at your end.

Logs are full of errors. 502s, and now … 429s? Hmm. The database is not under load, made barely any connections. Like 1/2 app instances connecting to it. I can’t connect to it from home. Says connection refused.

It’s in LHR:

2021-06-23T21:10:54.158364109Z app[9fff7785] lhr [info] keeper            | 2021-06-23T21:10:54.154Z	ERROR	cmd/keeper.go:1010	error retrieving cluster data	{"error": "Unexpected response code: 502"}
2021-06-23T21:11:27.163145352Z app[9fff7785] lhr [info] keeper            | 2021-06-23T21:11:27.158Z	ERROR	cmd/keeper.go:839	failed to update keeper info	{"error": "cannot set or renew session for ttl, unable to operate on sessions"}
2021-06-23T21:11:39.636908431Z app[0a9a0048] lhr [info] keeper            | 2021-06-23T21:11:39.632Z	ERROR	cmd/keeper.go:839	failed to update keeper info	{"error": "cannot set or renew session for ttl, unable to operate on sessions"}
2021-06-23T21:11:56.937007617Z app[9fff7785] lhr [info] keeper            | 2021-06-23T21:11:56.933Z	ERROR	cmd/keeper.go:1010	error retrieving cluster data	{"error": "Unexpected response code: 502"}
2021-06-23T21:12:16.994153123Z app[0a9a0048] lhr [info] keeper            | 2021-06-23T21:12:16.989Z	ERROR	cmd/keeper.go:839	failed to update keeper info	{"error": "cannot set or renew session for ttl, unable to operate on sessions"}
2021-06-23T21:12:20.331299106Z app[9fff7785] lhr [info] keeper            | 2021-06-23T21:12:20.326Z	ERROR	cmd/keeper.go:839	failed to update keeper info	{"error": "cannot set or renew session for ttl, unable to operate on sessions"}
2021-06-23T21:12:22.745991358Z app[0a9a0048] lhr [info] keeper            | 2021-06-23T21:12:22.741Z	ERROR	cmd/keeper.go:839	failed to update keeper info	{"error": "cannot set or renew session for ttl, unable to operate on sessions"}
2021-06-23T21:12:26.073285497Z app[9fff7785] lhr [info] keeper            | 2021-06-23T21:12:26.069Z	ERROR	cmd/keeper.go:839	failed to update keeper info	{"error": "cannot set or renew session for ttl, unable to operate on sessions"}
2021-06-23T21:12:38.444730807Z app[0a9a0048] lhr [info] keeper            | 2021-06-23T21:12:38.439Z	ERROR	cmd/keeper.go:839	failed to update keeper info	{"error": "Unexpected response code: 429 (Your IP is issuing too many concurrent connections, please rate limit your calls\n)"}
2021-06-23T21:17:22.884537834Z app[0a9a0048] lhr [info] keeper            | 2021-06-23T21:17:22.880Z	ERROR	cmd/keeper.go:1010	error retrieving cluster data	{"error": "Unexpected response code: 429"}

The flyctl status shows it as failed. Though it shows two instances as running too.

Status      = failed                                       
  Description = Failed due to unhealthy allocations

Looking now

Thanks :slight_smile:

Not sure if you just restarted it?

Now showing two instances as running and health checks passing. But … not according to the status line above them. Weird. That still shows as failed.

  Status      = failed                                       
  Description = Failed due to unhealthy allocations          
  Instances   = 2 desired, 2 placed, 1 healthy, 1 unhealthy  

Instances
ID       VERSION REGION DESIRED STATUS            HEALTH CHECKS      RESTARTS CREATED   
29d50728 2       lhr    run     running (leader)  3 total, 3 passing 0        2m5s ago  
f5a70632 2       lhr    run     running (replica) 3 total, 3 passing 0        3m47s ago

That status message is for the previous deployment which appears to have failed but the VMs worked themselves out after.

Ah. Yes, still shows as failed as the status. It would be good if that was the current status, not the prior one. So that updated on a new restart/deploy.

Did you do flyctl restart?

Is that what I should do? Ideally that would happen automatically if your system could detect the error. I guess it can’t be based on (status === ‘failed’) if that status does not reflect the current status, but some other monitor.

I stopped and restarted each VM, our shared consul service vanished from DNS earlier (different bug), which likely knocked your cluster out. We’ve disabled a lot of auto restart features on the Postgres VMs because normally you don’t want them! But it’s brittle.

We’re actively working on this, the stolon + Consul setup we have is not making us very happy so we’re trying to figure out more reliable options. Stolon errs on the side of “being really, really safe with data” which is good, but its consul implementation isn’t very resilient.

Ah, that would explain it. Thanks. Makes sense.

I’m willing you on :slight_smile: A reliable and cheap hosted database remains elusive. Seems like you can have one or the other. If you could be both that would be neat.

It should be doable, the expensive part is having people respond to alerts.

If you run into an issue like this again, you can try fly vm stop <id> on one of the VMs and see if it helps.

1 Like

Hmm. Well, not sure how practical this is, given the numbers, but maybe if you had an automated health test that tries to connect to a database app and run a query, even at the top level to show tables or databases or something (so it didn’t need any cpu/ram and should respond instantly). If it fails the query, restart the db instance? Like you did (stop/start, whatever).

As that fixed it.

That would save human intervention (inevitably at like 2am).

Else yes, it’s a case of the customer (or someone at your end) happening to notice an error and manually fixing it, which isn’t ideal.

That’s how script health checks and restarts work for “normal” fly apps. We just have it disabled for pg apps to prevent flapping restarts breaking failover. In this case restarting still failed until consul came back up. That’s why either fixing or replacing consul for stolon is a high priority.

1 Like

Ok. Yep, I agree :slight_smile:

1 Like