Load issue with Postgres

For context, I am on a performance 8x-16GB RAM machine.

I have been chasing an issue with the performance of my database for a few days now, and I think I finally have a small clue about it. For reasons still unknown to me, my CPU remains within the lower limits for most of the day. We are talking about less than 30% usage. The issue is that the load spikes to more than 100% throughout the day, and it does this in a very noticeable pattern. I have been able to identify that it does this in 90-100 minutes intervals. What I have not been able to identify is what causes them. I have looked into scheduled jobs, autovacuum/autoanalyze operations, replication. I have toyed with settings to the workers, shared_buffers, connections, you name it. My volume is fairly constant so if it was application load related I should be seeing constant load issues and perhaps higher CPU usage, but the constant spikes on very marked specific cycles make me thing there is something else going on.

The latest clue I have is that it has to do with checkpoints. Searching through the logs I have come across these messages, and while the checkpoints happen more often than the load spikes, I have seen them at the start of every spike. The actual checkpoint process seems to last about 5 minutes, but the spike period lasts around 30-40 minutes constantly.

2024-11-22T06:33:05.816 app[1781959f4de2d8] iad [info] postgres | 2024-11-22 06:33:05.815 UTC [477] LOG: checkpoint complete: wrote 32949 buffers (6.3%); 0 WAL file(s) added, 0 removed, 6 recycled; write=269.959 s, sync=0.743 s, total=270.705 s; sync files=44, longest=0.434 s, average=0.017 s; distance=94509 kB, estimate=94509 kB; lsn=DE/4547FE20, redo lsn=DE/1712E1B8

2024-11-22T06:33:27.206 proxy[1781959f4de2d8] iad [error] [PC05] timed out while connecting to your instance. this indicates a problem with your app (hint: look at your logs and metrics)

2024-11-22T06:33:27.223 proxy[1781959f4de2d8] iad [error] [PC05] timed out while connecting to your instance. this indicates a problem with your app (hint: look at your logs and metrics)

Now, the timeouts after the checkpoints are kind of telling, my theory so far is this:

  • Initial checkpoint causes heavy I/O
  • I/O pressure causes connection timeouts
  • Failed connections cause retries
  • System triggers another checkpoint due to load
  • Creates a cascading effect

For more context, I even upgraded to a 16x machine for a couple of days to try to see if that would help, but even then the spikes continued, and obviously a 16x machine I think is an overkill for what I am doing. Even the 8x one is over-provisioned, but I can work with that for a few more days.

Has anyone experienced similar issues? I am not a Postgres expert by any means, and I am not very familiar with things like replication, WAL settings, and similar. But even then, I am running this on a single machine. (I know I dont have redundancy, but my use case is very specific and I can live without it).

Here are some charts that show you the spikes in a very constant pattern.

My update after another day trying to find what is causing those periodic spikes:

Tried adjusting checkpoint settings, basically after looking at tons of documentation from percona and postgres I adjusted these, but although the checkpoints now start every 15 minutes instead of 5, the spikes continue to start at about 100 minutes apart, so I think it is safe to say it is not checkpoint related.

-- Increase I/O thresholds (in 8KB units)
ALTER SYSTEM SET checkpoint_flush_after = '128';     -- 1MB
ALTER SYSTEM SET bgwriter_flush_after = '128';       -- 1MB
-- Prevent frequent checkpoints
ALTER SYSTEM SET checkpoint_timeout = '900s';        -- 15 minutes
-- More aggressive background writing
ALTER SYSTEM SET bgwriter_lru_maxpages = 1000;
ALTER SYSTEM SET bgwriter_delay = '50ms';

My next theory as for the culprit is repmgr. I dont have solid data, but that is the only thing that comes to mind that might be trying to do some background tasks on a schedule.

Running this query gives me the result below which is not indicative of an issue.

SELECT pid, query
FROM pg_stat_activity
WHERE application_name = 'repmgr';

721,SELECT pg_catalog.pg_is_in_recovery()

I am going to continue to explore that, and perhaps look into the write ahead log after it.

I know Postgres here is not managed, but if I had a support plan on my account, would this be something that can be looked at via a support ticket?

Hi… As more of a small, generic troubleshooting tip, rather than an answer, per se…

Overall, I’m not really a big fan of “load average” as a diagnostic tool; the blog post that Fly.io links to in their official metrics documentation is largely (although not wholly) in agreement, actually:

Instead of trying to debug load averages, I usually switch to other metrics.

The one time in the past that I’ve seen behavior similar to this, an interactive desktop system that periodically was slowing down to an absolute crawl, :turtle:, it was ultimately traced back to the (then-)new Linux transparent_hugepage/defrag setting. The kernel was just deciding that it was a good time to spend several minutes defragmenting! That’s not necessarily what’s causing your own timeouts, but you do need a much more fine-grained view of things to make progress, I’d say…

Yeah, is not so much about the load numbers, but the behavior that comes with it. Right now my RPS is around 95, and while the spikes slow things down they don’t cause any errors. The problem is when the RPS increases during the heavier periods.

I have even thought about moving my database to a different host perhaps thinking that something could be wrong with this one. The behavior is so weird and remarkably on schedule that it keeps telling me it is something at a deeper level than I can debug.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.