Laravel 10 App Strange Performance Drops - From 158ms to 10s

Hi All,

I’m looking at moving my Laravel 10 stack from a Hetzner virtual machine to Fly. I’ve deployed my app successfully, but I’m seeing strange performance issues at intervals when accessing my main page. The page usually responds and loads in under 200ms. However, sometimes it spikes inexplicably and can take over 10s to load a static page. I don’t get these issues on my Hetzner VM, so I can conclude it’s something in my new Fly setup.

I was troubleshooting using Apache bench with the command ab -n 6336 -c 20 https://example.com/
example.com DNS is managed by CloudFlare. I’m using a custom domain with dedicated IPv4 and IPv6 addresses on my app, with those addresses added as A and AAAA records in Cloudflare respectively.

These issues are also present when using the default *.fly.dev URL, so CloudFlare is not the issue.

I currently only have 1 machine provisioned on Fly which is shared-cpu-2x@1024MB deployed in LHR.

Some images to show the issue, all of these were taken during the ApacheBench stress test.

CPU Usage

Firecracker Usage

HTTP Response Times

Fast, Normal App Response Times
image

During Spikes
image

HTTP Response Times 2

These issues are really the only thing preventing me from making the switch to Fly. I don’t really know much about the backend networking, and any optimisations that are needed, but my current VM with the same specifications at Hetzner does not suffer from these issues.

For context my Hetzner VM can handle ~20RPS, whereas the same site on Fly can only manage a measly 9RPS - The difference is quite significant, and is quite obvious when using my site.

I thought it may have been that PHP simply couldn’t handle that many requests on Fly, however even just with 1 user (myself) simply navigating around the site will trigger the issue. There are no database requests on the main page, and my database in general is on Planetscale also in LHR.

Any thoughts? Happy to provide more details if needed.

Thanks

1 Like

Hello @DropShift, and welcome to Fly.io!

It’s exciting that you’re looking into possibly moving your Laravel stack over to Fly, that’s really great to read!

Now, on to your post. There’s a specific configuration trait I suspect might be causing the strange increase in response duration of your Fly app.

See, by default a Fly application has auto_stop_machines=true set in its fly.toml config. This means that idle Fly Machines( those not handling any traffic ) running your Fly App gets auto-stopped by the Fly proxy. Of course when the Fly app receives a request, the Fly proxy needs to wake its Fly Machine up so that the Machine can handle the request. This might be a reason why there are only specific times when the spike happens.

Generally Fly Machines are supposed to wake fast, but even so, can you check if setting this auto_stop_machines config to false, and deploying, removes the spike in response time?

Hi Kathryn.

Thanks for the quick reply. This did occur to me, however I have already tried your suggestion and it did not solve the issue. The requests were being sent frequently enough that the server shouldn’t have spun down anyway. For clarity, here’s my full fly.toml config file:

# fly.toml app configuration file generated for example-app on 2024-02-17T20:58:48Z
#
# See https://fly.io/docs/reference/configuration/ for information about how to use this file.
#

app = 'example-app'
primary_region = 'lhr'
console_command = 'php /var/www/html/artisan tinker'
swap_size_mb = 512
auto_stop_machines = 'false'
auto_start_machines = true

[build]
[build.args]
NODE_VERSION = '20'
PHP_VERSION = '8.3'

[env]
APP_ENV = 'production'
...

[http_service]
internal_port = 8080
force_https = true
min_machines_running = 0
processes = ['app']

[[vm]]
memory = '1g'
cpu_kind = 'shared'
cpus = 2

[services.concurrency]
hard_limit = 500
soft_limit = 250
type = "requests"

Thanks again

Is this happening only when doing concurrent load testing or also when a single user is clicking around your site? You may be running into max FPM workers which causes subsequent requests to queue until a worker is free.

Both. Although I have specifically seen FPM issues when using ab which I didn’t see when a single users is browsing through. In either scenario, the issue still persists.

ab error:

[22-Feb-2024 18:26:17] WARNING: [pool www] server reached pm.max_children setting (10), consider raising it

Hi!

I’m not totally clear on if you saw pm.max_children errors on Fly?

In any case, can you check your max_children setting in your hetzner server to compare? The default is pretty low when first installing it, I wonder if it’s increased on your Hetzner instance.

The Dockerized version of PHP-FPM provided by fly launch will default to a max children of 10 (see here).

It’s actually controlled by environment variable, so one idea is to set PHP_PM_MAX_CHILDREN = 20 in your fly.toml’s environment section (use whatever value you want) and redeploy the app. That should get picked up from FPM and you can see if that improves matters.

Test that it works by ssh’ing in (fly ssh console) and running php-fpm8.3 -i | grep MAX.

It’s still possible the performance never matches Hetzner. To my knowledge, Hetzner is particularly good at CPU/bandwidth performance, but Fly is running on very different technology and is does not optimize for the same things. That being said, do check out the performance CPUs and see if they change things - run fly platform vm-sizes).

Thanks for the info, I was seeing pm.max_children errors on Fly when hitting my app with ab
I don’t see the error when a single user is browsing the site.

I implemented the update you suggested from 10 to 20, and confirmed it was deployed on the machine via SSH. However, the issue still persists:
image

The most frustrating thing is that it can load things quickly, and even outperforms Hetzner when it wants to:
image
It’s just not consistent, and this is the biggest issue. I can be browsing the site normally, loading pages in less than 500ms, then the next time I refresh it can take 10 seconds to load that very same page.

Thanks

Thanks! I don’t see anything obvious in the setup there - one very random thing is perhaps try to set concurrency within the [http_service] section instead of services.concurrecy

My other suggestion is to try out the Sentry integration and perhaps see if Sentry’s profiling (APM) can find the cause of those delays. It sounds network-related to me but it’s hard to say if that’s fly-proxy or something else (dns lookups to your redis instance or planetscale or something) without some more information.

I’ve never used Sentry before so I’m not 100% sure on what I’m looking for, it seems like quite a comprehensive tool. As for the [http_service.concurrency] - I did change it in my toml and weirdly the performance of my app does seem to have slightly improved, I still seem to get processing spikes where the homepage will take ~8s to load, but they seem to be less now. Did you ask me to make the change because it may have a performance impact, or because it’s just best practice?

Edit: I do see some warnings at the end of my deployment about fly-proxy not being able to reach my app, but I generally disregarded it as It looked to me like it was about SSH.

WARNING The app is not listening on the expected address and will not be reachable by fly-proxy.
You can fix this by configuring your app to listen on the following addresses:
  - 0.0.0.0:8080
Found these processes inside the machine with open listening sockets:
  PROCESS        | ADDRESSES
-----------------*---------------------------------------
  /.fly/hallpass | [ipv6]:22

Hi Dropshift,

Those are not about SSH ! It’s saying “I expect to find a process listening in :8080 but all I found is this one listening in :22 (that’s the SSH daemon)”. When you see those errors it means your app process is not listening in :8080. It could have died, or it might not be accepting connections because… reasons. But this is the proxy telling you there’s no server process for it to send requests to.

Cheers!

  • Daniel

So I think I may have solved this one, and it did come down to database DNS issues and high latency.
Even though my Fly app and my database are both in London, by switching temporarily to a MySQL server hosted on Fly, the issues went away and my performance drastically improved.
Not sure why I wasn’t experiencing this on Hetzner, given that the server was in Finland, but possibly a combination of running my own router and encrypted DNS server made the responses faster? Or perhaps Hetzner just has better peering? It’s quite strange overall.

Thanks for all the help!

2 Likes

@DropShift We’ve seen problems before connecting to PlanetScale using their ‘optimized’ connection string which tries to route you to the correct region. Sometimes it doesn’t route correctly. We’re discussing how to best solve this.

Meanwhile, you can use the ‘direct’ connection string, which you can find in the PlanetScale dashboard. In your case would mean prepending eu-west-2 to your hostname, so it would look something like mydb.eu-west-2.psdb.cloud.

Thanks for the info - I am currently using the “optimized” connection string, so this could help. I’ll change to the direct connection and report back.
Thanks again