Some load testing results and a few questions

jveres · September 28, 2020, 1:40pm

Hi! I’m making some load testing on a simple static web server with some basic setup: standard 1-10 scaling, micro-1x VM, FRA region.

Starting with 50RPS everything seems fine:

Increasing up to 550RPS, interestingly seems even better:

I guess it’s because of internal caching - at the time of testing flyctl logs was not working for me at all. At this stage the app was not scaling - still 1 micro-1x.
So then I set the rate to 1550RPS and it was still catching up, however the number of concurrent requests increased to 750.

At this point I’ve had to wait for a few minutes to get the app scaled up. During this period connections started to drop and I got some error responses too

But then it scaled up to 10 VMs and became responsive again. Maybe not on that level I’d have expected, though.

I was then trying to increase the rate above 2000 RPS but surprisingly the throughput remained fluctuating around 1900RPS. Maybe fly.io rate-limited my requests? I tried to increase the number of concurrent requests (“in-flight”) but results became even worst. So it seems to be the max throughput. I also observed that auto-scaling was a bit behind the actual demand.

Any comments or ideas on how to improve the throughput are welcomed. Thanks.

jveres · September 28, 2020, 2:03pm

I almost forgot, I used my ego network visualization experience for the load testing,
available here https://github.com/jveres/ego-ui
(currently served from https://ego.jveres.me)

jerome · September 28, 2020, 2:22pm

Hey there!

These are interesting results. Would you mind sharing your app name? I can look at what happened with a bit more precision. For now, I’m assuming this is the “egoweb” app which seems to have had a traffic spike recently.

I suspect your application was “queuing”, which happens as soon as your hard limit is reached. We also have a queue limit, at which point we drop connections. Your hard limit is set to 25, if it can handle more than 25 connections per second (1 request == 1 connection), then you can bump that up significantly. We should make the default higher. Most apps can handle more than that. Looks like your app is just static pages served by a go server. I’d try much higher limits for that kind of app.

Scaling is definitely not instantaneous. Usually takes a few seconds, a few minutes in the worst case scenarios. It depends on your image size and if the cache is warm on the targeted servers. Scaling horizontally happens automatically when some threshold is hit, based on your concurrency limits.

During your test, your hard limit was reached thousands of times per second .

As far as I can tell, your micro-1x didn’t work too hard. It’s hard to tell with such low concurrency limits.

We’re working on exposing more of these metrics.

jveres · September 28, 2020, 2:31pm

Hi @jerome!

Yes, it is “egoweb” a static Go server. Hard limit during load testing was set to 50.
I’m curious about your findings.

Thanks.

jerome · September 28, 2020, 2:33pm

I’m not seeing that here. Is it possible your deploy failed? I see the last 6 versions (they’re from the last 3-4 hours) all use a concurrency setting of 20,25.

jveres · September 28, 2020, 2:45pm

You’re right, deployed with 50 again. There’s a Deno backing service which creates the actual json result and that’s where I already set hard limit to 50. I measured that separately and it’s the same ~2000RPS max.

jerome · September 28, 2020, 3:09pm

I don’t really see much traffic at all in the past hour for your app.

kurt · September 28, 2020, 3:48pm

This could be limited by your test machine. Are you running this locally or from a VM somewhere?

jveres · September 28, 2020, 4:05pm

I’m running the tests locally from my MBP. It was my thinking as well that maybe I’m limited by my ISP but then other load tests would end up similarly - which is not the case to my knowledge but I’m going to double check.

jveres · September 28, 2020, 4:29pm

I think it’s not my test machine limitations, I’m able to load test http://example.com up to 5000RPS and beyond.

jveres · September 28, 2020, 4:35pm

I’ve tested that endpoint earlier, like between 2020-09-25 11:17 and 2020-09-25 11:40.

kurt · September 28, 2020, 4:38pm

What’s the actual command you’re running to test?

Testing this stuff is tricky, as you’ve found. I would recommend manually scaling your app to do load testing like this just to keep things as simple as possible. Here are some things you can try:

Your local machine could bottleneck on https (vs http on example.com)
Connection pooling (especially with https) makes a big difference. If you’re trying to test SSL performance, you’ll want to tune pooling differently than if you’re trying to test HTTP performance.

One thing to know about our infrastructure is that each request creates a new connection to your actual process from the local host hardware. Some servers get slow trying to handle that many tcp connections. We have an experimental concurrency mode that does http connection pooling between our proxy and your app. If you want to try that, add type = "requests" to the concurrency block in fly.toml. This will break autoscaling but should perform better for a blitz of tests.

Is there a request rate you’re aiming for out of curiosity?

jveres · September 28, 2020, 4:52pm

I’m using slapper.

Good catch! Testing https://example.com gives me only 2-3000RPS.

Just tried with type = "requests" but with that, at around 1500RPS all requests get dropped.

jveres · September 28, 2020, 4:59pm

I’m just exploring fly.io and preparing for an internal presentation for our devs.

kurt · September 28, 2020, 5:05pm

This is probably hitting the hard limit on one instance since the requests based concurrency doesn’t trigger autoscaling yet. ~1500RPS is about 50 concurrent requests that finish in 30ms each. If you run flyctl scale set min=10 and then try it again you might see a different result.

That 2000-3000RPS result you’re getting from example.org is probably the most you can expect from your laptop. Going beyond that will mean running tests concurrently from multiple hosts on multiple networks.

jveres · September 28, 2020, 5:11pm

You are most probably right, thanks for looking into it.

kurt · September 28, 2020, 5:22pm

Thanks for the notes. And slapper looks really neat.

jveres · September 28, 2020, 5:35pm

Indeed. Btw what would you recommend for distributed load testing?

kurt · September 28, 2020, 6:15pm

We just run burn (equivalent of slapper) from a bunch of VMs in multiple regions. I know of a few people who’ve used locust against their apps.

jveres · September 28, 2020, 9:37pm

I just checked again with https://example.com and now it easily went up to 6k RPS which shows me that the local bandwidth shouldn’t be a bottleneck.
The scaled up 10x micro-1x static serving still tops at 3000RPS. Using only 1 micro-1x brings 1.5k RPS while 10x micro-1x brings 3k RPS. Changing the settings to use type="requests" and flyctl scale set min=10 as advised, brings ~3300RPS.

In other words it’s 10x VM cost increase for a little bit more than doubled throughput. Maybe any other ideas to try to increase the performance @kurt? Thanks!

Topic		Replies	Views
QQ about load testing using fly.io	1	386	November 10, 2021
Autoscale doesn't seem to work with hard_limit = 1 and soft_limit = 1	13	1323	September 7, 2021
Dowtime for more that 15 minutes already	7	350	November 3, 2022
Sudden decrease in throughput, no recent changes Questions / Help elixir	13	542	October 21, 2022
Autoscaling causing 502s	4	699	October 31, 2023

Some load testing results and a few questions

Related topics