Frequent slow requests in our Rails app deployed to Fly

pepicrft · March 4, 2024, 2:07pm

Hi
We have a Rails app running in Fly and a monitor set up with BetterStack that lets us know when the service is unreachable. For some reason that we cannot diagnose, we see peaks of 60 seconds in requests in Fly.io’s Graphana dashboard, which leads to timeouts in some requests and BetterStack starting incidents due to the slowness of the responses.

At first, we thought it had something to do with the auto-scaling mechanism, which we disabled, but it continues to happen. AppSignal provides performance metrics, and none of them seems to indicate a performance issue in our app, so we are running some ideas for debugging the issue. We wonder if it has something to do with the Puma server, which the Docker image uses, or some Fly configuration/issue. Do you have any hints on how to debug this issue?

jerome · March 4, 2024, 3:02pm

Hey there,

The problem appears with mis-configured concurrency settings. They’re set to soft: 5000 and hard: 6000. Those are very high values, especially for a Rails app on a 1x shared vCPU machine.

Auto-scaling won’t kick in unless the soft limit is reached on all instances of a region, but I don’t think your app can get to 5000 connections concurrently. I also noticed Puma is set to use 5 threads. All 5 are fighting over a single vCPU, I expect using a 2x shared vCPU machine might help.

To summarize: I would try lowering the concurrency limits drastically and probably use a bigger instance.

pepicrft · March 4, 2024, 4:09pm

Thanks a lot @jerome for your response.

The problem appears with mis-configured concurrency settings. They’re set to soft: 5000 and hard: 6000. Those are very high values, especially for a Rails app on a 1x shared vCPU machine.

Good callout. I reduced the number to 1000 hard and updated the spec to 2 cores instead of one. Do you think it’s a sensible number for a Rails setup?

Auto-scaling won’t kick in unless the soft limit is reached on all instances of a region, but I don’t think your app can get to 5000 connections concurrently. I also noticed Puma is set to use 5 threads. All 5 are fighting over a single vCPU, I expect using a 2x shared vCPU machine might help.

Thanks a lot! I followed your suggestion. I’ll report back with some results after leaving it running for a couple of days.

Do you think the issue could have been that all those threads were locking each other while accessing the CPU?

pepicrft · March 4, 2024, 5:16pm

I’ve been running the new machines for over half an hour, and we still notice peaks of slow requests. Here’s a screenshot from Better Stack.

We are also getting 502, which are not coming from the Rails app itself, so I wonder if it’s happening at the Puma level or the Fly infra. The service can receive sudden loads of requests. For example, 100 in 1 second, all waiting for IO operations to complete. This is something we are working on optimizing, but I was a bit surprised that the stack would not be able to handle that gracefully.

michaell · March 4, 2024, 8:27pm

I just “tuned” a PHP app. I don’t know if my approach was right or wrong, but maybe it’s helpful.

Ultimately, what you’re trying to figure out is how much concurrency the app can handle without blowing out any key resources — CPU, disk, memory, network.

I set the hard_limit low, and put the app under sustained load — say, at least two to three minutes.

In my case, disk, memory and network clearly weren’t bottlenecks, so I focused on CPU.

I noted the max CPU utilization under load, and if it was under 100% I made a guess about how much more work the app might be able to do. I increased the hard_limit and ran the test again.

Ultimately, what I wanted was for CPU utilization to be as high as possible without hitting 100%. Because if I hit 100% the app would obviously start to fall further and further behind.

Indeed, the “sustained” part of “sustained load” is very important. What I kept seeing is low CPU utilization initially, followed some time later by a ramp up as the app started to fall behind. If the concurrency wasn’t set too high, it would eventually stabilize again. If it was set too high… Destruction. Carnage. Chaos.

On a 2 core shared CPU, my soft_limit is 10 and my hard_limit is 20. That gives me ~185 RPS under sustained loads of up to 30 simultaneous connections. With 30 connections the average response time is ~170ms, with a large standard deviation.

I should point out that the URL I was testing here purposely avoided any caching; I was trying to get a feel for something like “worst case.” Hard to say how much caching would improve the results. But it’s still a PHP app, not Go or Rust; there’s a limit to how fast it can get!

rubys · March 4, 2024, 9:24pm

I’m going to start by assuming that you are using PostGreSQL. If not, let’s talk through that.

Next, you are undoubtedly using Puma (that’s effectively the default these days). If not, let’s talk.

Puma defaults to threads and Ruby has a Global Interpreter Lock (GIL) which limits how far you can push threads. For your load, processes are a better match (trading off a little bit of latency for throughput). Look up Puma cluster mode,

Finally, what you want to do is to find out how many messages per second a single server can handle, aim slightly below that and define a number of machines that are only spun up when needed. Look into http_service_concurrency and auto_stop_machines.

pepicrft · March 5, 2024, 7:31am

Thank you, folks Besides your recommendations, I did some further reading on Ruby’s GVL and the Puma HTTP server, and I have a better understanding of what the situation was:

At the Puma layer, many requests were waiting in the queue, some of which took too long to process, and others were directly rejected with 5xx errors. Those requests are mostly IO operations, so they’d significantly benefit from increasing the number of threads.
I increased the number of threads to 15, the number of CPUs to 4, and reduced the soft and hard limits in the Fly configuration to let Fly scale the service before requests start piling in the queue.
I’m going to optimize the API design. Right now, a client can send 100 requests when it could just be 1 with many threads waiting for IO responses. I still need to check if it’s ok to have 100 threads spawned by a request and have the results joined. My biggest concern is that we end up with GVL contention.

Once again, thanks to everyone for helping. I’ll update this thread with my findings in case anyone finds them useful.

rubys · March 5, 2024, 12:47pm

Excellent!

I’m kinda curious as to why you went with threads instead of processes. Processes should make better use of multiple CPUs.

rubys · March 5, 2024, 12:59pm

By the way, if it is you doing the read directly (vs calling some gem), you can avoid threads using Method: IO.select — Documentation for core (3.0.2) .