Database down again (lhr)

greg · April 11, 2021, 3:35pm

Hello,

I noticed some requests to a cron failing despite nothing being changed. I tried connecting to the database it goes via myself locally, and couldn’t. Hmm. I tried both ports 5432 and 5433 as last time this happened the proxy was down but the vm was working: Possible issue with database.

But neither connected.

So … I looked at the status and there appear to be no VMs listed … ?!

flyctl --app NAME status
App
Name = NAME
Owner = NAME
Version = 1
Status = pending
Hostname = NAME

Deployment Status
ID = 6dd0077a-e3ab-a5b5-97a4-25966bb1d394
Version = v1
Status = successful
Description = Deployment completed successfully
Instances = 2 desired, 2 placed, 2 healthy, 0 unhealthy

Instances
ID VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED

It seems it has moved to pending, seems stuck, and has no VMs listed there. Perhaps it tried to restart itself? I know you had ams issues yesterday but was lhr down too? Strange.

I checked the logs for that pgdb app (the com one) and sure enough there are pages of sentinel errors. Any thoughts? I’m guessing it has been down for at least 5 hours, but it may be longer. It was working 24 hours ago.

Lots of this stuff:

2021-04-11T14:38:03.635Z 65e6a368 lhr [info] sentinel | 2021-04-11T14:38:03.632Z INFO cmd/sentinel.go:995 master db is failed {“db”: “8b5a7116”, “keeper”: “fdaa05d9a7ba9a0fa22”}
2021-04-11T14:38:03.637Z 65e6a368 lhr [info] sentinel | 2021-04-11T14:38:03.635Z INFO cmd/sentinel.go:1006 trying to find a new master to replace failed master
2021-04-11T14:38:03.639Z 65e6a368 lhr [info] sentinel | 2021-04-11T14:38:03.637Z INFO cmd/sentinel.go:741 ignoring keeper since it cannot be master (–can-be-master=false) {“db”: “cbef15e3”, “keeper”: “fdaa05d9a7ba980fa32”}
2021-04-11T14:38:03.641Z 65e6a368 lhr [info] sentinel | 2021-04-11T14:38:03.640Z ERROR cmd/sentinel.go:1009 no eligible masters
2021-04-11T14:38:09.219Z 65e6a368 lhr [info] sentinel | 2021-04-11T14:38:09.216Z WARN cmd/sentinel.go:276 no keeper info available {“db”: “8b5a7116”, “keeper”: “fdaa05d9a7ba9a0fa22”}
2021-04-11T14:38:09.224Z 65e6a368 lhr [info] sentinel | 2021-04-11T14:38:09.222Z INFO cmd/sentinel.go:995 master db is failed {“db”: “8b5a7116”, “keeper”: “fdaa05d9a7ba9a0fa22”}
2021-04-11T14:38:09.226Z 65e6a368 lhr [info] sentinel | 2021-04-11T14:38:09.224Z INFO cmd/sentinel.go:1006 trying to find a new master to replace failed master
2021-04-11T14:38:09.230Z 65e6a368 lhr [info] sentinel | 2021-04-11T14:38:09.228Z INFO cmd/sentinel.go:741 ignoring keeper since it cannot be master (–can-be-master=false) {“db”: “cbef15e3”, “keeper”: “fdaa05d9a7ba980fa32”}
2021-04-11T14:38:09.232Z 65e6a368 lhr [info] sentinel | 2021-04-11T14:38:09.230Z ERROR cmd/sentinel.go:1009 no eligible masters

I haven’t restarted (which is how you fixed it before) to let you can check its status at your end. You can though.

kurt · April 11, 2021, 3:38pm

Was this app previously in a different region? That means the specific postgres instance is configured as a read only replica. Our script does this when the DB is out of its primary region.

I’ll have a look and get your DB back.

kurt · April 11, 2021, 3:47pm

Somehow this one was set to Chicago for its primary region. If you started there and then migrated it to LHR, that makes sense. If you did not, please let me know because we need to figure out what caused it.

To make things worse, the health check failures (can’t find leader) were causing the instances to restart multiple times, then ultimately fail.

You should be back online now. We’re working on preventing those restarts when things go unhealthy, it should keep your DB from dying entirely when there are strange issues.

greg · April 11, 2021, 3:51pm

No. It’s always been LHR. That’s where it was started and I have not moved it or deployed it or changed it since.

I looked at flyctl regions list out of interest and the only options there are lhr, and then ams and cdg as the backup pool. Chicago?! Nope

kurt · April 11, 2021, 3:53pm

Well would you look at that: postgres-ha/fly.toml at main · fly-apps/postgres-ha · GitHub

I think your cluster might’ve been created before we were setting that properly (and before we introduced the readonly role). When it died, it pulled the newer postgres-ha image and suddenly started enforcing that.

greg · April 11, 2021, 5:57pm

Ah, well spotted.

It’s not ideal from a data protection point of view if a database can spontaneously move continents Some applications need personal data stored in the EU or US. So that’s an important catch.

This got me thinking … I don’t know how many applications you host, whether it’s a thousand or million. So this may not be practical. But maybe it would be good to add an internal check on whether an app has 0 VMs? Given that you don’t support scaling-to-zero as yet, no database will ever have 0 VMs if it’s working, and I guess apps won’t either (though maybe suspended ones … hmm, you’d have to consider that as part of the test).

So if an app is not suspended and has 0 VMs, your system can catch that case and jump in and start some, and that would avoid human intervention possibly hours later. As, in this case, no automated system picked up this issue.

Now in this case, it would not have been auto-fixable (given the Chicago issue) so no VMs would have started, however at least some alert would have gone off at your HQ that something was amiss.

Combine that with the prior healthcheck on the proxy and that should avoid two issues.

kurt · April 11, 2021, 9:03pm

You actually can scale to 0, it’s just manual. We do want to catch when apps fail, though, it’s just harder than we expected to filter through the noise. Most of the time apps crash for reasons we can’t fix. Postgres is special because we “wrote” the Fly app for it.

We have a fix going out for one of the bugs that hit you. The health checks were defined in such a way that they’d take a read only replica down if the leader vanished, which is not actually what anyone wants.