Cluster leader failing health checks waiting for CPU

Our Database is a cluster mad of 1 leader and 2 replicas, Our leader stopped responding to health checks so we tried to restart it. After restarting another VM takes its place and responds for a short time and after it fails health checks too.

Fly checks provided this message:

Running fly status --app database

Running flyctl vm status --app database dc733d59
image

Running flyctl logs --app database

@shemafred3 taking a look!

I noticed that sometimes it can take a little bit for health checks to be updated, i’ll make a note to look deeper into that.

With regards to the failing health check, the VM checks that have the format <metric>: seconds waiting over the last <interval> are pressure checks that actually communicate percentage of time rather than seconds.

The failing check you’re seeing should really say:
The system spent 5.2% of the last 10 seconds waiting for CPU, which translates to roughly half a second vs. 5.2 seconds…

This is a known bug that should be resolved very soon.

Now the rpc errors are back. We had a replica that was not responding to health check for hours and we stopped it. But the instance that came is now dead:

>flyctl status --app database
App
  Name     = database
  Owner    = paypack
  Version  = 16
  Status   = running
  Hostname = database.fly.dev

Instances
ID       TASK VERSION REGION DESIRED STATUS                 HEALTH CHECKS       RESTARTS CREATED
7851a79c app  16      lhr    run     running (rpc error: c) 3 total, 3 critical 0        14m36s ago
cebd72e9 app  16      lhr    run     running (rpc error: c) 3 total, 3 critical 0        22m25s ago
801716ef app  16      lhr    run     running (leader)       3 total, 3 passing  0        22h33m ago

Looking at one of the instances:

>flyctl vm status --app database 7851a79c
Instance
  ID            = 7851a79c
  Task          =
  Version       = 16
  Region        = lhr
  Desired       = run
  Status        = running (rpc error: c)
  Health Checks = 3 total, 3 critical
  Restarts      = 0
  Created       = 17m54s ago

Recent Events
TIMESTAMP            TYPE       MESSAGE
2021-09-15T09:42:54Z Received   Task received by client
2021-09-15T09:42:54Z Task Setup Building Task Directory
2021-09-15T09:42:57Z Started    Task started by client

Checks
ID   SERVICE STATE    OUTPUT
vm   app     critical rpc error: code = Unknown desc = Post "http://unix/v1/exec": EOF
role app     critical rpc error: code = Unknown desc = Post "http://unix/v1/exec": EOF
pg   app     critical rpc error: code = Unknown desc = Post "http://unix/v1/exec": EOF

@shemafred3 This is an unfortunate bug related to our older script checks. I went a head an upgraded your cluster to use our latest image which addresses this issue. Sorry for the inconvenience!

I’m seeing the same thing on one of my clusters. How does one go about doing the upgrade, and is there an easy way to see what version is currently running?