For context, I am on a performance 8x-16GB RAM machine.
I have been chasing an issue with the performance of my database for a few days now, and I think I finally have a small clue about it. For reasons still unknown to me, my CPU remains within the lower limits for most of the day. We are talking about less than 30% usage. The issue is that the load spikes to more than 100% throughout the day, and it does this in a very noticeable pattern. I have been able to identify that it does this in 90-100 minutes intervals. What I have not been able to identify is what causes them. I have looked into scheduled jobs, autovacuum/autoanalyze operations, replication. I have toyed with settings to the workers, shared_buffers, connections, you name it. My volume is fairly constant so if it was application load related I should be seeing constant load issues and perhaps higher CPU usage, but the constant spikes on very marked specific cycles make me thing there is something else going on.
The latest clue I have is that it has to do with checkpoints. Searching through the logs I have come across these messages, and while the checkpoints happen more often than the load spikes, I have seen them at the start of every spike. The actual checkpoint process seems to last about 5 minutes, but the spike period lasts around 30-40 minutes constantly.
2024-11-22T06:33:05.816 app[1781959f4de2d8] iad [info] postgres | 2024-11-22 06:33:05.815 UTC [477] LOG: checkpoint complete: wrote 32949 buffers (6.3%); 0 WAL file(s) added, 0 removed, 6 recycled; write=269.959 s, sync=0.743 s, total=270.705 s; sync files=44, longest=0.434 s, average=0.017 s; distance=94509 kB, estimate=94509 kB; lsn=DE/4547FE20, redo lsn=DE/1712E1B8
2024-11-22T06:33:27.206 proxy[1781959f4de2d8] iad [error] [PC05] timed out while connecting to your instance. this indicates a problem with your app (hint: look at your logs and metrics)
2024-11-22T06:33:27.223 proxy[1781959f4de2d8] iad [error] [PC05] timed out while connecting to your instance. this indicates a problem with your app (hint: look at your logs and metrics)
Now, the timeouts after the checkpoints are kind of telling, my theory so far is this:
- Initial checkpoint causes heavy I/O
- I/O pressure causes connection timeouts
- Failed connections cause retries
- System triggers another checkpoint due to load
- Creates a cascading effect
For more context, I even upgraded to a 16x machine for a couple of days to try to see if that would help, but even then the spikes continued, and obviously a 16x machine I think is an overkill for what I am doing. Even the 8x one is over-provisioned, but I can work with that for a few more days.
Has anyone experienced similar issues? I am not a Postgres expert by any means, and I am not very familiar with things like replication, WAL settings, and similar. But even then, I am running this on a single machine. (I know I dont have redundancy, but my use case is very specific and I can live without it).
Here are some charts that show you the spikes in a very constant pattern.