Hi @Biggles,
Thanks for sharing this issue with the Grafana dashboards! I see the same thing on our Grafana dashboard, and have a couple answers for you:
-
The CPU Time panel queries do have a bug- it looks like the unit for this metric is actually not seconds but kernel ticks (which are 1/100 of a second), so removing the
* 100
will give more accurate results. The documentation states this incorrectly- thanks for catching this @Wayrunner, I’ll make sure this gets updated soon.
I’ve also found that usingrate
instead ofirate
gives smoother results, which may be more useful (see this StackOverflow answer for more detail/discussion). -
We just launched a new set of dashboards on a managed Grafana at fly-metrics.net, which includes an updated CPU Utilization panel that should work correctly. These new dashboards will soon be added to the public repo and published to grafana’s portal so you can easily import into your own Grafana instance.
This dashboard uses the following query which I would recommend:
(sum(rate(fly_instance_cpu{instance="$instance",mode!="idle"}[60s]))by(mode)>0) / count(fly_instance_cpu{instance="$instance",mode="idle"})
Dividing the
sum{mode!="idle"}
bycount{mode="idle"}
gives you a 0-100% (non-idle) utilization across all CPUs on the instance.The ‘per-cpu’ utilization query gives you 0-100% utilization for each individual CPU:
sum(rate(fly_instance_cpu{instance="$instance",mode!="idle"}[60s]))by(cpu_id)
I find that both views together are helpful to get a more complete picture of what’s going on CPU-wise with an instance.
Hope this is helpful!