Cluster leader failing health checks waiting for CPU

shema · September 14, 2021, 12:17pm

Our Database is a cluster mad of 1 leader and 2 replicas, Our leader stopped responding to health checks so we tried to restart it. After restarting another VM takes its place and responds for a short time and after it fails health checks too.

Fly checks provided this message:

Running fly status --app database

Running flyctl vm status --app database dc733d59

Running flyctl logs --app database

shaun · September 14, 2021, 12:28pm

@shema taking a look!

shaun · September 14, 2021, 1:24pm

I noticed that sometimes it can take a little bit for health checks to be updated, i’ll make a note to look deeper into that.

With regards to the failing health check, the VM checks that have the format <metric>: seconds waiting over the last <interval> are pressure checks that actually communicate percentage of time rather than seconds.

The failing check you’re seeing should really say:
The system spent 5.2% of the last 10 seconds waiting for CPU, which translates to roughly half a second vs. 5.2 seconds…

This is a known bug that should be resolved very soon.

shema · September 15, 2021, 10:01am

Now the rpc errors are back. We had a replica that was not responding to health check for hours and we stopped it. But the instance that came is now dead:

>flyctl status --app database
App
  Name     = database
  Owner    = paypack
  Version  = 16
  Status   = running
  Hostname = database.fly.dev

Instances
ID       TASK VERSION REGION DESIRED STATUS                 HEALTH CHECKS       RESTARTS CREATED
7851a79c app  16      lhr    run     running (rpc error: c) 3 total, 3 critical 0        14m36s ago
cebd72e9 app  16      lhr    run     running (rpc error: c) 3 total, 3 critical 0        22m25s ago
801716ef app  16      lhr    run     running (leader)       3 total, 3 passing  0        22h33m ago

Looking at one of the instances:

>flyctl vm status --app database 7851a79c
Instance
  ID            = 7851a79c
  Task          =
  Version       = 16
  Region        = lhr
  Desired       = run
  Status        = running (rpc error: c)
  Health Checks = 3 total, 3 critical
  Restarts      = 0
  Created       = 17m54s ago

Recent Events
TIMESTAMP            TYPE       MESSAGE
2021-09-15T09:42:54Z Received   Task received by client
2021-09-15T09:42:54Z Task Setup Building Task Directory
2021-09-15T09:42:57Z Started    Task started by client

Checks
ID   SERVICE STATE    OUTPUT
vm   app     critical rpc error: code = Unknown desc = Post "http://unix/v1/exec": EOF
role app     critical rpc error: code = Unknown desc = Post "http://unix/v1/exec": EOF
pg   app     critical rpc error: code = Unknown desc = Post "http://unix/v1/exec": EOF

shaun · September 15, 2021, 2:50pm

@shema This is an unfortunate bug related to our older script checks. I went a head an upgraded your cluster to use our latest image which addresses this issue. Sorry for the inconvenience!

bekit · September 15, 2021, 5:57pm

I’m seeing the same thing on one of my clusters. How does one go about doing the upgrade, and is there an easy way to see what version is currently running?

pawlarius · August 15, 2023, 3:45pm

Hi @shaun i tried to upgrade my psql image flyctl image update --app [my-psql-app-name] and the command said it succeeded like this:

But when I checked in the machine logs, it still retuning a bunch of errors like no keeper info available, failed to update keeper info {"error": "Unexpected response code: 500 (No cluster leader)"} like this:

Could you help advise what we should do?
Since I’ve tried upgrading the image and the errors won’t go away…

Topic		Replies	Views
database cluster unstable Questions / Help	3	390	August 17, 2021
Leader node has issues passing the CPU health checks even after crazy scale up Questions / Help	4	308	October 3, 2021
HTTP Health checks failing, but not restarting app	5	1023	July 25, 2023
Unexpected Restarts metrics	3	745	September 17, 2020
Health check failure Questions / Help	6	1310	April 26, 2023

Cluster leader failing health checks waiting for CPU

Related topics