Am I holding it wrong? Laravel performance on fly.io

I ran some tests this morning, using a Laravel-based test app I’ve tested on other providers.

I followed the fly.io docs for Laravel; I’m using the fly.io-based Docker image, etc.

Also: I’m using an SQLite database. For these tests I’ve just left the database on the ephemeral volume. (These are read-only tests.)

My tests have involved tweaking these parameters in fly.toml:

[env]
  ...
  PHP_PM_MAX_CHILDREN = '10'
  PHP_PM_START_SERVERS = '5'
  PHP_MIN_SPARE_SERVERS = '5'
  PHP_MAX_SPARE_SERVERS = '5'
  ...

[http_service]
  ...
  [http_service.concurrency]
    ...
    soft_limit = 5
    hard_limit = 10
...
[[vm]]
  memory = '512mb'
  cpu_kind = 'shared'
  cpus = 1

My tweaks have mostly involved making sure that resources aren’t under-utilized — e.g., that CPU isn’t at 50%.

I’ve run the tests from my Mac, using variations of this commandline:

wrk -c 10 -d 180s -t 5 --latency 'https://xxx.fly.dev/'

That’s not a heavy load, just a sustained one.

The results I’m getting aren’t good. I think I must be overlooking something?

For shared-cpu-1x@512mb I’m getting ~3.5 RPS.

For shared-cpu-2x@512mb I’m getting ~5.5 RPS.

For shared-cpu-4x@1gb I’m getting ~11.5 RPS.

Most surprisingly, for performance-cpu-1x@2gb I’m getting ~3.3 RPS.

Laravel is not light, not fast; and shared CPUs are shared CPUs. But at this point I’m questioning my testing more than the results.

Generally speaking, I’m using fairly low hard_limits, because there’s no point adding more load to a CPU that’s already fully loaded.

And I’m setting php-fpm max children roughly in line with the hard_limits.

But, really, changes to those parameters haven’t made much difference. (Again, that’s once I get to 100% CPU utilization.)

Thoughts?

As you say, shared CPUs are never going to be that impressive. As you work up through their 1/2/4 ladder, the RPS improves accordingly, so that would suggest the CPU is the issue. This post says that you can have up to 16 other users and so you could expect some throttling:

However that wouldn’t explain your performance-cpu test. You’d expect that would be much better :thinking:. Maybe … SSH in to the machine, run htop (or equivalent) to see the processes and then run a slightly longer test to see if you get the expected number of child workers. Not sure.

More info re: performance-cpu-1x@2gb

When under load, I’m showing that it’s ~17% user and ~83% system.

When under load, network I/O maxes out at ~235kb/s (~30KB/s). That happens when CPU utilization hits 100%, so it seems like I’m CPU-bound rather than network-bound?

When under load, I get no more than ~3.5 RPS.

When not under load, my web monitor shows response times of ~300ms — which is right in line with ~3.3 RPS.

So: consistent. But slower than I would have expected for a “performance CPU”?

Not sure where else to look?

But still hoping it’s a config problem!

P.S. I’ve checked that I’m not in debug mode, not overrunning any of the OPcaches, etc. And, again, my starting point is fly.io’s official Laravel image; I’m just turning the most obvious knobs.

I haven’t used PHP in about 20 years, but even back then it could do much better than 3 qps.

I’m testing against shared-cpu-1x:512MB with an Elixir/Phoenix application.

On a very heavy page it ends up being cpu bound but still does 43 qps at 100% cpu.
On a lighter page which serves markdown, it does 150qps at 20% cpu, so it can probably do 500qps+.

I would expect php to be slower, but not 10x slower. Maybe 2-3x.

CPU utilization
image

Heavy page

$ wrk -c 10 -d 180s -t 5 --latency 'https://abc.fly.dev/'
Running 3m test...
  5 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   228.39ms   33.94ms 545.91ms   80.99%
    Req/Sec     9.21      2.43    20.00     79.16%
  Latency Distribution
     50%  226.24ms
     75%  240.74ms
     90%  259.69ms
     99%  359.49ms
  7881 requests in 3.00m, 2.53GB read
Requests/sec:     43.76
Transfer/sec:     14.39MB

markdown page

  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    68.57ms   14.95ms 285.53ms   96.66%
    Req/Sec    29.47      8.08    40.00     37.44%
  Latency Distribution
     50%   66.10ms
     75%   66.70ms
     90%   68.17ms
     99%  135.25ms
  8831 requests in 1.00m, 445.65MB read
Requests/sec:    146.99
Transfer/sec:      7.42MB

Well, we agree on that: these numbers don’t make sense.

I just don’t know where else to look.

php-fpm has an optional /ping page, which is served directly by php-fpm itself. I enabled it, and this is what I got:

$ wrk -c 25 -d 120s -t 2 --latency 'https://....fly.dev/ping'
Running 2m test @ https://....fly.dev/ping
  2 threads and 25 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    21.37ms   65.86ms   1.57s    96.54%
    Req/Sec     1.53k   250.22     2.23k    76.84%
  Latency Distribution
     50%    5.66ms
     75%    8.35ms
     90%   51.95ms
     99%  278.21ms
  350347 requests in 2.00m, 165.37MB read
  Socket errors: connect 0, read 0, write 0, timeout 25
Requests/sec:   2917.94
Transfer/sec:      1.38MB

So clearly there’s the potential for much better performance. But every time I cross into PHP — well, more specifically, into Laravel — I can’t do better than ~3.5 RPS.

The only time I’ve seen numbers this bad is when OPcache was mis-configured. But I’ve checked; OPcache is installed, and working. None of the caches has been overrun, and the cache hit rate is very, very high. I even set opcache.validate_timestamps to 0, to prevent OPcache from checking for file changes…

I’ve tried many other tweaks too — e.g., disabling the access logs — but nothing has made any difference.

This exact same Laravel code, btw, can perform > 100 RPS on a shared host that I use.

Anyway, as I said, I just don’t know where else to look.

Progress, of sorts.

First, I created a simple .php page that didn’t involve Laravel at all:

$ wrk -c 10 -d 10s -t 1 --latency 'https://....fly.dev/hello.php'
Running 10s test @ https://....fly.dev/hello.php
  1 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     6.88ms   14.31ms 176.84ms   97.09%
    Req/Sec     2.04k   349.48     2.53k    90.00%
  Latency Distribution
     50%    4.38ms
     75%    5.24ms
     90%    6.86ms
     99%   91.05ms
  20289 requests in 10.01s, 7.62MB read
Requests/sec:   2027.67
Transfer/sec:    780.08KB

Holy cow.

So PHP isn’t the problem.

After that, I created a series of Laravel routes that incrementally added more complexity:

Route::get('/__hello__', function () {
    return 'Hello!';
})->name('hello');

Route::get('/__hello0__', function () {
    return response(
        'Hello 0!'
    );
})->name('hello0');

Route::get('/__hello1__', function () {
    return response(
        'Hello 1!'
    )->header('Content-Type', 'text/plain');
})->name('hello1');

Route::get('/__hello2__', function () {
    return response(
        view('hello2', [
        ])
    )->header('Content-Type', 'text/plain');
})->name('hello2');

Route::get('/__hello3__', function () {
    return response(
        view('hello3', [
            'pages' => Page::wherePublished()->orderByAToZ()->get(),
            'posts' => Post::wherePublished()->orderByNewestToOldest()->simplePaginate(25),
            ])
    )->header('Content-Type', 'text/plain');
})->name('hello3');

Route::get('/__hello4__', function () {
    return response(
        view('hello4', [
            'pages' => Page::wherePublished()->orderByAToZ()->get(),
            'posts' => Post::wherePublished()->orderByNewestToOldest()->simplePaginate(25),
            ])
    )->header('Content-Type', 'text/plain');
})->name('hello4');

(hello3 executes database queries, but doesn’t use the results. hello4 executes database queries and uses the results.)

The first route was ok-ish:

wrk -c 10 -d 10s -t 1 --latency 'https://....fly.dev/__hello__'
Running 10s test @ https://....fly.dev/__hello__
  1 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    72.16ms   12.44ms 122.58ms   70.57%
    Req/Sec   138.08     22.38   181.00     72.00%
  Latency Distribution
     50%   71.12ms
     75%   80.12ms
     90%   88.06ms
     99%  105.32ms
  1383 requests in 10.05s, 578.00KB read
Requests/sec:    137.57
Transfer/sec:     57.49KB

But the second was enough to drop me to ~3.5 RPS:

wrk -c 10 -d 10s -t 1 --latency 'https://....fly.dev/__hello0__'
Running 10s test @ https://....fly.dev/__hello0__
  1 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.74s     7.80ms   1.75s    83.33%
    Req/Sec     7.36     10.65    40.00     85.71%
  Latency Distribution
     50%    1.74s 
     75%    1.74s 
     90%    1.75s 
     99%    1.75s 
  33 requests in 10.07s, 13.86KB read
  Socket errors: connect 0, read 0, write 0, timeout 27
Requests/sec:      3.28
Transfer/sec:      1.38KB

Why? Why does simply creating a response object destroy performance? Why why why?

Not sure.

Not even sure where to start looking.

I did a fly ssh console and php artisan about to verify the general app state:

  Environment ...........................................................  
  Application Name .............................................. Laravel  
  Laravel Version ............................................... 10.45.0  
  PHP Version ...................... 8.3.3-1+ubuntu22.04.1+deb.sury.org+1  
  Composer Version ................................................ 2.7.1  
  Environment ................................................ production  
  Debug Mode ........................................................ OFF  
  URL ....................................................... ....fly.dev  
  Maintenance Mode .................................................. OFF  

  Cache .................................................................  
  Config ......................................................... CACHED  
  Events ..................................................... NOT CACHED  
  Routes ......................................................... CACHED  
  Views .......................................................... CACHED  

  Drivers ...............................................................  
  Broadcasting ..................................................... null  
  Cache ............................................................ file  
  Database ....................................................... sqlite  
  Logs ........................................................... stderr  
  Mail ............................................................. smtp  
  Queue ............................................................ sync  
  Session ........................................................ cookie  

  Filament ..............................................................  
  Packages .............. filament, forms, notifications, support, tables  
  Version ....................................................... v3.2.35  
  Views ................................................... NOT PUBLISHED  

  Livewire ..............................................................  
  Livewire ....................................................... v3.4.6  

Not in debug mode, config is cached, routes are cached, views are cached…

Hmmm.

I don’t know how much more time I’m going to sink into this.

1 Like

The difference between hello3 and hello4 is Laravel overhead. Would be curious to know what that is, but it seems like it is your database query that is the problem.

  1. How many rows are you looking at?
  2. Do you have any indexes?
  3. Add other common database questions.

Don’t give up now. You’re close.

1 Like

Clearly the bottleneck is the DB.

I’m hosting a laravel app and I don’t have your speed issue at all.

Maybe you should monitor your db performance, or try to retrieve your rows by chunks.

2 Likes

I had another look at this as I was curious!

I would disagree with the DB being the bottleneck. That would certainly explain any slowness with e.g request __hello3__ … that wouldn’t explain the slowness with e.g request __hello0__ … since that request doesn’t query it.

To try and replicate what you are seeing, I just created a brand new Laravel app (10.46) with all of the same settings you have.

Locally I hit them both using wrk. Same RPS for both endpoints, so Laravel can’t be the issue else locally the numbers would differ.

I deployed to Fly to recreate that. Used the same fly launch, which uses the same PHP 8.3 version etc etc. To eliminate any load balancing/scaling, I dropped it down to just one machine.

And hit that using wrk.

Again, I couldn’t recreate what you are seeing.

These were my results on a shared CPU 1x, 256 MB RAM (the cheapest of all the machines):

wrk -c 10 -d 10s -t 1 --latency 'https://...fly.dev/__hello__'
Running 10s test @ https://...fly.dev/__hello__
  1 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   184.92ms   14.61ms 269.96ms   94.31%
    Req/Sec    53.83     23.45    99.00     55.79%
  Latency Distribution
     50%  183.38ms
     75%  187.89ms
     90%  192.47ms
     99%  259.61ms
  527 requests in 10.07s, 1.11MB read
Requests/sec:     52.34
Transfer/sec:    112.60KB

And your other endpoint, using the response helper (which shouldn’t make any difference HTTP Responses - Laravel 10.x - The PHP Framework For Web Artisans) …

wrk -c 10 -d 10s -t 1 --latency 'https://....dev/__hello0__'
Running 10s test @ https://...fly.dev/__hello0__
  1 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   192.20ms   17.47ms 301.79ms   86.51%
    Req/Sec    51.52     22.86    99.00     70.53%
  Latency Distribution
     50%  190.76ms
     75%  196.69ms
     90%  202.96ms
     99%  267.79ms
  504 requests in 10.07s, 1.06MB read
Requests/sec:     50.06
Transfer/sec:    107.79KB

All good.

As a bonus, I wondered about throwing more at it so upped to 20 concurrent …

wrk -c 20 -d 15s -t 1 --latency 'https://...fly.dev/__hello__'
Running 15s test @ https://...fly.dev/__hello__
  1 threads and 20 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   194.51ms   17.77ms 287.96ms   80.08%
    Req/Sec   103.09     32.63   171.00     60.27%
  Latency Distribution
     50%  192.88ms
     75%  200.45ms
     90%  208.77ms
     99%  269.46ms
  1516 requests in 15.08s, 3.19MB read
Requests/sec:    100.56
Transfer/sec:    216.49KB

Double the RPS. Machine still handled it with ease.

I SSH-ed in to the machine, ran htop, and while the CPU did get up to 50%, no flashing red lights there either.

And this is despite Fly seemingly routing my requests via IAD of all places. The app is in LHR. At least based on the request ID its returns. So I get even worse latency.

Fly does have an option (on by default) to auto-turn off machines. That will cause a terrible initial latency, which would skew the timings. I found it was 4-5s to begin with. So if you are running tests for only 10s, that could be a factor … as half the time it would be serving 0 RPS. But other than that … not sure.

:thinking:

2 Likes

@tj1 @AsymetricalData Definitely not the database! I hit the slowdown in the second test, and the second test doesn’t touch the database at all.

@greg Thanks very much for that, Greg! I’ll try to reproduce what you did, and take it from there…

2 Likes

Sorry, forgot to reply to that. I turned off auto-start/stop before doing any testing at all. :slight_smile:

1 Like

@michaell Ah, ok, so it’s not the machines are stopped to begin with then. Weird!

All I did was what I assumed you did :slight_smile:

composer create-project laravel/laravel example-name
cd example-name
(added your two routes to web.php to replicate them)
npm install
npm run build (not strictly needed as your response doesn’t need them but might as well)
php artisan serve (ok, all good, works locally, so time to deploy it)
fly launch (adds the Dockerfile and friends to tell Fly how to build the image)
fly scale count 1 (since by default it makes 2)

Your launch command will show the machine size: I didn’t use a “performance” machine but given you reported RPS issues even with a shared CPU, I figured it can’t be that.

I didn’t edit anything like opcache, logs, php.ini etc. My php artisan about looked the same as yours.

1 Like

Something that may be worth trying, even if it merely rules something out. If it isn’t the database, and isn’t Laravel, perhaps it is the network?

Create a file named Dockerfile.wrk

FROM ubuntu
RUN apt-get update && apt-get install -y wrk

Launch it using:

fly console --dockerfile Dockerfile.wrk -C bash

And then run wrk from within the fly.io network.

2 Likes

Wow, I had completely misread your post.

Can you share your url? I want to run wrk myself as I’m wondering if there’s something off with your benchmarking setup at this point.

@greg Thank you sooooo much!

I found the problem.

Before starting with a brand new Laravel app, I thought about what I have — which is really pretty simple, pretty straightforward.

It’s just a test app, after all.

But it does have two parts, a pure-Laravel front-end and a FilamentPHP-based back-end.

FilamentPHP is an amazing way to create back-ends very quickly. But it is, shall we say, rather resource intensive.

FilamentPHP recommends that you run two artisan commands in production:

php artisan filament:cache-components
php artisan icons:cache

I hadn’t added them to the generated caches.sh script in the .fly directory, because all of my performance tests were focused on the front-end.

It turns out, though, that FilamentPHP affects front-end performance, even though there isn’t any FilamentPHP code in the front-end!

I added those two lines to the script, re-deployed — and that fixed the problem:

$ wrk -c 10 -d 10s -t 1 --latency 'https://....fly.dev/'
Running 10s test @ https://....fly.dev/
  1 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   104.64ms   17.36ms 243.84ms   80.10%
    Req/Sec    95.25     12.41   120.00     66.00%
  Latency Distribution
     50%  102.89ms
     75%  112.28ms
     90%  123.37ms
     99%  166.72ms
  954 requests in 10.06s, 7.21MB read
Requests/sec:     94.82
Transfer/sec:    733.57KB

$ wrk -c 20 -d 10s -t 1 --latency 'https://....fly.dev/'
Running 10s test @ https://....fly.dev/
  1 threads and 20 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   135.20ms  171.51ms   1.66s    95.34%
    Req/Sec    95.71     20.33   131.00     72.73%
  Latency Distribution
     50%  102.45ms
     75%  113.45ms
     90%  138.75ms
     99%    1.23s 
  528 requests in 10.07s, 3.99MB read
  Socket errors: connect 0, read 0, write 0, timeout 3
Requests/sec:     52.43
Transfer/sec:    405.66KB

Those results are based on my current config — PHP_PM_MAX_CHILDREN, hard_limit, etc. I think I may be able to beat them once I re-tune with this critical fix in place.

Fantastic.

Thanks to all, and most especially to @greg!

5 Likes

@michaell Ah … FilamentPHP. Yep, that’ll do it.

You’re welcome. It’s satisfying to get the answer :rocket:

2 Likes

OK, to perhaps bring all of this to a close…

Those last results were for a performance CPU, and not tuned.

Now, with some tuning, I’m getting:

  • ~95 RPS on shared-cpu-1x@512MB; and
  • ~185 RPS on shared-cpu-2x@512MB.

Clearly the performance CPU could have done much better — but shared is more than good enough!

4 Likes

Is this displaying results from sqlite(hello4) or hello0?

Those are the results from hitting the home page of the site. The home page is roughly equivalent to hello4: two database queries and a Blade-based page generated from the query results.

1 Like

Apologies for dropping in late here, but I just wanted to add a quick clarification about performance vCPUs (for anyone who might come across this) related to a question that came up earlier.

If a particular server has enough hardware CPUs to run all of the Machine vCPUs that are ready to run, then we’ll let all of them run at full speed. As a result, if the server that hosts your shared-vCPU Machine isn’t too busy at the moment, then your shared vCPU should run as fast as a performance vCPU. However, when there are more Machine vCPUs that are ready to run than there are hardware CPUs, then performance vCPUs get 4x the time to run, and consequently they’ll be much faster. Furthermore, we limit how many vCPUs can be allocated to a server, and performance vCPUs count the same as 4 shared vCPUs, so the worst-case performance of a performance vCPU is 4x better than a shared vCPU.

To summarize: if there’s extra CPU time available, then we’ll let shared vCPUs run faster, but if you depend on that, you may want to consider a performance vCPU.

In any case, I’m glad that you’ve figured out what was going on!

2 Likes