Our Database is a cluster mad of 1 leader and 2 replicas, Our leader stopped responding to health checks so we tried to restart it. After restarting another VM takes its place and responds for a short time and after it fails health checks too.
I noticed that sometimes it can take a little bit for health checks to be updated, i’ll make a note to look deeper into that.
With regards to the failing health check, the VM checks that have the format <metric>: seconds waiting over the last <interval> are pressure checks that actually communicate percentage of time rather than seconds.
The failing check you’re seeing should really say: The system spent 5.2% of the last 10 seconds waiting for CPU, which translates to roughly half a second vs. 5.2 seconds…
This is a known bug that should be resolved very soon.
Now the rpc errors are back. We had a replica that was not responding to health check for hours and we stopped it. But the instance that came is now dead:
>flyctl status --app database
App
Name = database
Owner = paypack
Version = 16
Status = running
Hostname = database.fly.dev
Instances
ID TASK VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
7851a79c app 16 lhr run running (rpc error: c) 3 total, 3 critical 0 14m36s ago
cebd72e9 app 16 lhr run running (rpc error: c) 3 total, 3 critical 0 22m25s ago
801716ef app 16 lhr run running (leader) 3 total, 3 passing 0 22h33m ago
Looking at one of the instances:
>flyctl vm status --app database 7851a79c
Instance
ID = 7851a79c
Task =
Version = 16
Region = lhr
Desired = run
Status = running (rpc error: c)
Health Checks = 3 total, 3 critical
Restarts = 0
Created = 17m54s ago
Recent Events
TIMESTAMP TYPE MESSAGE
2021-09-15T09:42:54Z Received Task received by client
2021-09-15T09:42:54Z Task Setup Building Task Directory
2021-09-15T09:42:57Z Started Task started by client
Checks
ID SERVICE STATE OUTPUT
vm app critical rpc error: code = Unknown desc = Post "http://unix/v1/exec": EOF
role app critical rpc error: code = Unknown desc = Post "http://unix/v1/exec": EOF
pg app critical rpc error: code = Unknown desc = Post "http://unix/v1/exec": EOF
@shema This is an unfortunate bug related to our older script checks. I went a head an upgraded your cluster to use our latest image which addresses this issue. Sorry for the inconvenience!
I’m seeing the same thing on one of my clusters. How does one go about doing the upgrade, and is there an easy way to see what version is currently running?
But when I checked in the machine logs, it still retuning a bunch of errors like no keeper info available, failed to update keeper info {"error": "Unexpected response code: 500 (No cluster leader)"} like this: