I have an application which has at most 20 concurrent users. I have spun up 3 machine clusters, all on V2 of apps.
The app will quite often hang, failing to load anything from the database for up to 2 minutes at a time. I don’t see anything obvious in the logs or the metrics that would suggest common failures like resource exhaustion except for this log in the Postgres app:
2023-07-10T00:12:34.399 health[xxx] ewr [error] Health check for your postgres vm has failed. Your instance has hit resource limits. Upgrading your instance / volume size or reducing your usage might help. [✗] cpu: system spent 1.22s of the last 10 seconds waiting on cpu (18.64µs)
I have found Postgres Configuration Tuning · Fly Docs but what I cannot figure out is how to get the data I will need to figure out what is actually going wrong so that I change the right setting. I would rather not scale something that is unrelated. The overview page says it has an average memory usage of 1/3 the available memory, and average load of 0.75, and the metrics graphs don’t show any meaningful spikes that would explain why resources are exhausted.
Anyone have any ideas? The almost always fails at the critical moment when someone is trying to change something (it is used for scheduling at a small music camp) so this is a pretty serious flaw that I need to address as soon as possible.
to be clear: this is serious enough, I may be forced to find another provider if there are no good solutions. I can’t keep getting text messages from faculty members telling me the app won’t let them schedule stuff.
It’s always tricky to debug issues like this as there are so many variables. Some thoughts …
The best place to start would be with that line from the log. If the database is indeed having to wait for the CPU, that would explain large delays. Since it can’t do much without that. You say you don’t want to scale anything unrelated (which makes sense - you don’t want to pay for more resources than you actually need) however at least when it comes to debugging, sometimes that is exactly what you need to do to see what impact it has. With Fly you pay per second/minute or whatever, and so you could just upgrade the CPU for a day and see if your issues continue. If those errors go away, well that was highly likely the issue. Solved. If not, well logically it can’t have been that. You could increase the RAM. And so on.
If the budget doesn’t allow even a short increase in cost, the other (better) approach is to dig into your code and/or queries to see what exactly is using up the resources to avoid you needing to pay more. Look at what requests are slow. Depending on your app’s language (PHP, Nodejs …) and ORM (Eloquent, Prisma …) if you are using one, there should be some kind of debug option to see what queries it is doing.
You mention it failing when someone is trying to change something . That suggests an UPDATE query. Is that UPDATE looking at many rows? Is it using an index? The biggest performance boost is going to be from adding an index on whatever column you have in the WHERE. Else it will need to look at all the rows to find a match, which is slow and inefficient. Like … if you have WHERE user_id = ‘abcde’ … is there an index on that user_id column? If there isn’t, and there are lots of rows, that will be much slower than if there is.
If you have a copy of your query, try running it manually on your local machine. Stick EXPLAIN in front of it to see what Postgre’s plan is. Again, showing whether it will use an index or not.
average load of 0.75 is quite high for postgres (considering you have just a few concurrent users). It indicates that something is not performing well, such as missing indexes or a large amount of data being processed / aggregated.