High CPU usage all of a sudden

Hello, we are experiencing something weird with our fly deployment. All of a sudden the CPU Utiization spikes and then never goes down. There is no apparant use on the server that should cause it. This is how it looks in grafana. I’m not sure what “steal” is but that seems to be the culprit. Is there anything we can do about it?

Worth mentioning is that its the same problem on both our postgres machine and the app (the remix server)

Are you using swap? We see such spikes when kswapd frantically tries to keep the processes running despite overwhelming memory pressure (probably caused by a memory leak we haven’t yet tracked down); a tonne of madvise syscalls and what not push up CPU utilization (try using strace and friends the next time you catch it happening).

1 Like

On our remix app we use swap, so i will look into it, the memory utilization is pretty low tho. And on our postgres machine we dont use swap so for that server i dont think that is the problem. The weird thing is that it happend at the exact same time for both machines. But we didnt see a spike in users or anything.

I got the same issue yesterday (Sat. 20 January) from 14;00 to 23:00 approx.

I still don’t know why but it stopped.

Did you do anything to fix it? It still a problem for us.

Nothing at all. I restarted the VM multiple times but it didn’t worked at all.
And suddenly at 23:00, it stopped.

Still no clue about it.

I see, i have also tried restarting them without any change. Very annoying

I can confirm that our problem solved itself around 23:00-23:20. I have a feeling that it is a problem with the infrastructure and not something that we can do about it. Would like to talk to fly about it but we are only on the hobby plan for now. But if the problem keeps coming back we have to do something.

Today the CPU usage went up again about an hour ago. Will have to keep looking into it.

“steal” can generally be ignored here. we allocate partial cores for cpus, you should be able to burst up to the full cpu core if it’s not in use. if it is in use, your machine’s kernel reports that time as “steal”. it doesn’t mean your application is using more CPU - it just means the idle time for your CPU was used elsewhere.

generally it should be near 0 (we run out of allocatable memory on a host long before we run out of allocatable cpu). If enough machines burst at the same time on the same host, you’ll see steal reported.

1 Like

I see, but what i don’t understand is when we get these spikes in the last 2 weeks or something. The only indication that something is happening is that our website is many times slower in loading than 90% of the time, and it is not due to our own traffic to the page. And the only thing we can see (on two different machines, our postgres and our remix app, at the exact same time) and the only thing that is changing is our CPU utilization. This image is from today:

We did not have a spike in visitors, our memory and network didnt change at all. Only CPU, and this is if we inspect it more:

I’m not an expert in these kinds of stuff so i’m just trying to figure out what is happening but it seems to me that it is something else with the server than our remix app / postgres database since we didn’t see an increase in users at all.

Do you have a tip for us to try to figure out what is going on? Can it be someone else on the shared server doing something heavy?

Edit: The “spikes” around 18:15 is our increase in users and when we were testing stuff so that is as it should be, but as you can see around 12 where it goes up WAY more than we can make it do ourselves even if we try.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.