New CPU scheduling is causing big problems for my lightly used postgres-flex cluster

I have a small postgres-flex cluster of 3 shared-cpu-2x:1024MB machines running Postgres 15 that I use to run a tiny 2-member Mastodon server that doesn’t get much traffic. The database is not large; about 5 GB.

I paid close attention to the recent CPU scheduling announcements to make sure it wouldn’t be a problem for me, and after analyzing my metrics I wasn’t worried because most of the time these machines sit well below 6.25% utilization and max out their quota balance. But since quota enforcement began, I’ve had several unexpected problems that have made this cluster very difficult to manage.

  • A full WAL archive backup (which was happening every 24 hours — the default established by fly pg backup enable) causes such high CPU usage that the primary typically begins failing io health checks, and sometimes even fails to respond to pings and queries for so long that queries time out and replicas lose their connections and have to reconnect. This happens even when the machine has plenty of CPU quota available and (as far as I understand) shouldn’t be getting throttled.

  • On a few occasions (admittedly before I scaled the machines up from shared-cpu-1x to shared-cpu-2x to try to mitigate the problem) a full WAL archive backup seems to have induced some sort of runaway scenario where the primary’s CPU usage was pegged for so long it exhausted its quota. Then so many operations began timing out and triggering reconnects that the connection limit was exhausted, leaving the cluster in a broken state that I had to recover from manually by forcing the promotion of a new primary (I’m not sure if there was a better way — I’m admittedly not a Postgres expert).

  • I decided to try disabling WAL archiving altogether, only to discover that there’s no fly pg backup disable command, and I’m not sure how to do it manually in a way that won’t cause problems.

  • Next I decided to try creating a new Postgres 16 cluster to see if it exhibited the same issue. I created a new cluster and then tried to use fly pg import to import the data from the old cluster. The import process caused such high CPU utilization on the old primary that it exhausted its CPU quota and queries began timing out. It also exhausted the quota on the new primary, and queries began timing out there too.

  • Eventually, after a long wait, the import process finished and seemed to have been successful. But when I looked at the database on the new Postgres 16 cluster, I discovered that it was only partially populated. Some tables were complete, while others were empty. My guess is that even though the fly pg import command itself didn’t report any problems, some inserts must have failed and not been retried. This is a pretty scary failure state, since I could easily have overlooked the data loss.

I’m not sure what to do. 99% of the time the shared CPU quota isn’t a problem, and the price jump between shared CPUs and performance CPUs is prohibitive for my use case, so it doesn’t make sense to scale up just so backups won’t wreak havoc on my cluster. Plus I’ve already bought lots of shared CPU machine reservations.

For now I’ve used fly pg backup config update to reduce the full backup frequency to once a week to mitigate the pain, since I can’t disable backups and can’t trust fly pg import to migrate to a new cluster.

I’m not a very experienced Postgres user and I realize that this isn’t managed Postgres so some manual effort and occasional headaches are to be expected, but I didn’t expect the headaches to come from using fly pg backup and fly pg import to do exactly what they were designed to do. :laughing:

I hope this feedback helps inform future decisions about these features. If anyone has advice for me, I’m all ears!

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.