HTTP response times all over the place

I switched over the weekend from a k8s cluster to fly.io after spending a week experimenting with fly.

Since the switch, http response times are all over the place. Sometimes they are as they should be (70 ms or less) and other times they are in the seconds or error out. I’ll attach some pictures of the problem.


My api app has known performance characteristics for the past 2 years it has been running off of fly.io. It is a rust actix-web rest api connected to a postgresql database. It’s pretty simple in terms of architecture. I’ve changed so many settings to figure out what is going on and I can’t figure it out. Some users have complained about things just spinning or taking forever to load as well.

My app is deployed in dfw and was also in iad before I pulled it out of iad to see whether that was the problem. I’m using dedicated cpu’s for it, so I’m not experiencing the randomness of a shared cpu.

I have no idea what is going on, but if it continues on, I’m going to have to go back to my previous provider. Is there some sort of networking issue going on at fly.io? I really need help getting this resolved.

1 Like

Is this connecting to a database somewhere? And are you sure it’s only running in DFW? fly status will show you where specific instances are running.

We sometimes see performance issues like this when requests get routed to a VM that’s not close to a database.

It is connected to the leader that is also running in dfw.

I’ve checked fly status for both api and database. Api is just running in dfw (scale of 2) and database is running in dfw (leader), dfw (replica), and iad (replica).

Try removing the IAD replica and see it if anything improves? Moving app servers to one region was a good start, the way the Postgres connections work you could theoretically be connecting through IAD from DFW though.

We’ll have a look and see if anything pops out at us, too, but there aren’t any network/platform issues that we’re aware of. A “simple” Rust app that talks to Postgres should perform as well as it did on your other infrastructure though, assuming similar DB resources.

I’ve removed the iad replica. I’ll monitor to see if response times improve.

Overall response times are still higher than expected. Most of the api endpoints should return in less than 70 ms round trip time. Many get up in the 300-400ms RTT or even higher.

I’m still seeing high tls handshake times, handshake errors, and multi second http response times.

With http2, TLS errors can happen when the actual request is slow. It’s confusing, but I think the root problem here is slow HTTP responses.

Can you narrow down what’s actually slow? Our proxy is “seeing” slow responses from the app. If it’s something in our environment, it seems most likely it’ll be between the app and the db instances though. Do you have any tracing available within the app to see if that’s the case?

What kind of hardware was this running on before? Is it possible you need more than one CPU for concurrency?

Looking a some charts, it seems like CPU spikes quite a bit. This might indicate contention. dedicated-cpu-1x is still just 1 CPU.

Can you try (as a test) dedicated-cpu-2x?

I’m changing to a dedicated-cpu-2x right now. I’ll see if that improves things. I wish y’all offered a plan for more cpu cores and less memory. Being a rust app, it barely uses any memory at all.

I was previously running at minimum two instances of the api on a 3 worker, 2 core per worker k8s cluster. Each instance was given a minimum of 0.3 of a cpu core. The workers were on dedicated cpu cores.

The database we were previously running off of was only a single core with 2 GB memory, so there isn’t a difference here when it comes to database.

I did some tracing stuff last night to determine whether the problem was the time to get a connection to the database, but those times were all less than 3 milliseconds.

As @jerome said, it may just be cpu contention that is the problem here. I didn’t scale up to 2x cpu initially because 4 GB of memory is a waste for my app. I’ve never ever seen memory usage go up above about 300-400 MB of memory usage, so 4 GB is just overkill. Any plans for offering more cpu cores with less memory?

I think the connection itself should be fast, but possibly some queries?

Looks like it’s still slow even with 2 CPUs.

Do you know how many threads each process had in your previous setup? 2 CPUs == 2 dedicated threads.

Yes, this is coming in the next few months.

Some queries are slower than others, that’s for sure. And those queries can’t really be optimized more than they already are. But most endpoints that should be FAST are often not fast. Sometimes they are, sometimes they aren’t.

I think it was just spawning in 1 thread previously.

That’s a good sign! The TLS handshake times you’re showing are our metrics right? What does request time look like from inside one of the worker? (fly ssh console -a your-app -s, then run curl in there)

Yes, those are fly.io prometheus metrics put into grafana.

curl request times look almost identical to running local for equivalent endpoints and data when connected via ssh inside a worker.

The tracing work you did earlier showed that connections to postgres were fast — the queries themselves were slow? or something else in the response time (some pre/post-procesing)?

Do the DB schemas look exactly the same on k8s and fly? (indexes?) I know very little about your app so just throwing ideas out there.

I was thinking that the connections to postgres were somehow slow causing the terrible performance. That didn’t turn out to be the case as it was taking anywhere from 0-3ms to establish the connection.

I then checked timings all throughout a particular route and everything looked normal.

The actix-web middleware logger displays overall times to process each request. I was seeing yesterday, when doing tracing, process times of about 3 ms that equated to about 300-600ms from the user’s perspective in the browser. Then I’d also see normal times of 60-70ms when nothing had changed and no external traffic. Then performance would tank again after nothing changing. That’s why I was wondering about some networking issue.

@kurt Response times are still abysmal regardless of cpu count for either database or api.

I just had an endpoint take 1.23 seconds RTT whereas the api only took 0.000207 seconds to process the request.

I’ve ran some apachebench load tests last night and I can’t get beyond about 140 requests per second for routes that don’t even hit the database. During the runs, I had top open in the container console, and cpu usage was only around 3%.

1 Like

Where did you find this 0.000207 number? Are you logging response times?

As far as our metrics go, the response time from your app match the response times at the edge. Meaning it took pretty much the same amount of time to get a response from your app as it took to response fully from our edge.

Often when users measure their response times, they don’t include various parts of the timing:

  • Reading the request
  • Processing the request
  • Writing the response headers

This is why I’m asking how you’re measuring this :slight_smile:

Our own timing does not include writing the whole body, just the headers. However, if I look at the body timings, it seems like it’s taking a long time to transmit the whole body.

Now I’m wondering if this isn’t a buffer issue of some kind. Buffers being either too big or too small.

Do you have an example request for us to test with? No need to include the app name or any sensitive information, we should be able to figure that out (unless we need to auth to your app).

The 0.000207 came from the logging middleware inside my api. It reports the time it took to get a response from the endpoint. The 1.23 seconds number came from the web browser developer tools networking section. I can also reproduce the numbers inside Paw.

Some endpoints are intentionally slow as they have to generate pdf’s or perform network requests to places like Stripe. Those routes I’m not concerned about being slow, but all the routes that perform very simple requests and are slow are the problem.

The issue is most easily reproduced when multiple requests come in at the same time from the web browser. Individual requests once a connection has been established yield correct times.

I have a couple public endpoints that can be used for testing.

  • /version → returns a string compiled into binary
  • /db-health → connects to database and returns 200 if connection can be established
  • /states → connects to database and returns the names and abbreviations of all 50 states

I can provide information via private message to log in and experience the issue if you have trouble reproducing.

Does your browser show a breakdown of where that time was spent? Something like this?

This is the slowest I got, from CDG (Paris):

This is slow enough that I’d want to look at it, but I’m curious to look at your numbers too, and which region you’re doing the request from.