Idle CPU Usage on Grafana seems extremely high: 70,000%+

Hi!

I recently hooked up my fly.io app (a multiplayer turn-based game server that operates over websockets) to the standard Grafana dashboard you guys provide & it seems to work great, although I noticed that even when I have no active connections, the “Idle” CPU usage seems consistently extremely high, e.g. at 73,059% in the screenshot below:

The query for the “Idle” CPU time is:

sum(irate(fly_instance_cpu{mode!='idle',mode!='user',mode!='system',mode!='iowait',mode!='irq',mode!='softirq',app=~"^$app$",region=~"^$region$",host=~"^$host$"}[$__rate_interval])) * 100

When I filter out the “Idle” CPU time, usage still seems pretty high, with a mysterious “other” category frequently being over 100%:

I’m assuming that this is just a quirk relating to how these numbers are being calculated / presented, I’ve heard that this number often doesn’t take into account multi-core architectures, for example.
It certainly doesn’t look like we’re using a crazy amount of CPU time:

We’re currently on the free tier while testing out fly.io, but I’m keen to migrate the game server over permanently. Since our app is quite likely to require work to improve performance / efficiency in the future, and I’m relatively new to the languages and frameworks involved I’m keen to have access to accurate performance metrics to ensure:

  1. We’re not doing anything really stupid in terms of performance that will result in high usage costs when we go live
  2. We can accurately measure the effectiveness of future changes which aim to improve performance / efficiency etc.

Any insight into why these numbers appear so high or any changes I should make to ensure Grafana displays more accurate / actionable info would be much appreciated!

Many thanks!

1 Like

We got the same problem. I think that fly_instance_cpu is actually the time the CPU spends being idle and not the percentage.

From the docs:

cpu is derived from /proc/stat, and counts the number of seconds spent each CPU (cpu_id) has spent performing different kinds of work (mode, which may be one of user, nice, system, idle, iowait, irq, softirq, steal, guest, guest_nice).

Given that I still don’t understand how the numbers would look like this though. Some more documentation around this would be greatly apreciated.

1 Like

Hi @Biggles,

Thanks for sharing this issue with the Grafana dashboards! I see the same thing on our Grafana dashboard, and have a couple answers for you:

  1. The CPU Time panel queries do have a bug- it looks like the unit for this metric is actually not seconds but kernel ticks (which are 1/100 of a second), so removing the * 100 will give more accurate results. The documentation states this incorrectly- thanks for catching this @Wayrunner, I’ll make sure this gets updated soon.
    I’ve also found that using rate instead of irate gives smoother results, which may be more useful (see this StackOverflow answer for more detail/discussion).

  2. We just launched a new set of dashboards on a managed Grafana at fly-metrics.net, which includes an updated CPU Utilization panel that should work correctly. These new dashboards will soon be added to the public repo and published to grafana’s portal so you can easily import into your own Grafana instance.

    This dashboard uses the following query which I would recommend:

    (sum(rate(fly_instance_cpu{instance="$instance",mode!="idle"}[60s]))by(mode)>0) / count(fly_instance_cpu{instance="$instance",mode="idle"})

    Dividing the sum{mode!="idle"} by count{mode="idle"} gives you a 0-100% (non-idle) utilization across all CPUs on the instance.

    The ‘per-cpu’ utilization query gives you 0-100% utilization for each individual CPU:

    sum(rate(fly_instance_cpu{instance="$instance",mode!="idle"}[60s]))by(cpu_id)

    I find that both views together are helpful to get a more complete picture of what’s going on CPU-wise with an instance.

Hope this is helpful!

1 Like

Brilliant, thanks for the fast response, that’s cleared up a whole bunch of stuff! Also realised I had the app field still set to All but after implementing some of your suggested changes, I’m now getting far more sensible results for our playtest server:

This represents a 3-player internal test game from 16:00 - 16:30 followed by what appears to be a couple of games in progress following the announcement on our Discord around 18:00 & if this is accurate, it looks promising!

Now I’ve just got to work out how to get custom metrics into Grafana, but think I’m pretty close on that front.

Thanks again for the help!