Postgres app claiming instance is hitting resource limits

Health check for your postgres vm has failed. Your instance has hit resource limits. Upgrading your instance / volume size or reducing your usage might help. [✗] cpu: system spent 1.09s of the last 10 seconds waiting on cpu (30.91µs)

I’m seeing this showing up in my logs after noticing a performance decrease in my app. I’m not sure what the issue is, as the volume is less than 50% used.

Instance has been restarted, but the issue persists.

Can someone assist?

1 Like

same problem https://community.fly.io/t/high-steal-cpu-usage-2/21176

Thanks – that time period looks about the same as my instance.

Sorry, actually shorter, but still, out of the blue.

Identified - We have identified a bad commit that has disrupted some of our platform operations and are working to roll back quickly.
Aug 07, 2024 - 19:11 UTC

I assume this is it…

Hi @miker,

Can you please share a machine ID from your app so I can look it up and diagnose further? Ideally, the machine from which you got that load avg graph you showed earlier.

FWIW I don’t think it’s related to the incident you mentioned :slight_smile:

  • Daniel

Hi Daniel – machine ID is 3d8d3e5bee6d18. Managed to restart it once, but not again - stuck waiting on lease. Logs now say the database is malfunctioning.

Hi @miker,

So, unfortunately the restart was probably affected by the incident. If possible try restarting the machine again, now that the incident is solved.

If you care about data availability it’s usually a good idea to have three Postgres machines in a cluster, that way if one goes down the remaining two can still provide service.

I think I know what’s going on, stand by, I’ll report back in a few minutes.

My machine is now spamming (8x/s) this message into the logs and nearly peaked 100% 5min Load Avg :

sentinel | 2024-08-07T19:31:21.629Z	WARN	cmd/sentinel.go:276	no keeper info available	{"db": "9da7df65", "keeper": "41d10d41dd2"}

machine ID : 9080292a666078

Thanks Daniel – restart worked this time. I can certainly add machines in time,when the service I’m running starts getting some money in. Right now, it’s free.

A single-machine database is free at that scale but how much is your data worth :wink:

On a more serious note - we spotted some suspicious activity on the host where your machine lives, we’ve cleared that out so your database should now respond more quickly and the CPU health check should also clear up.

Regards,

Well, there is the volume with snapshots, but you’re not wrong at all :grin:

Thanks! I’ll monitor for the next few mins and see what happens.

Looks like it’s slowly recovering (CPU-wise) and responses appear to be quicker as well. Thanks for the quick assist!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.