CPU Quotas Update

I will be moving away from fly soon. Our service was hugely affected. We run a small rails app which was performing well, had barely 200 requests per day, even that has trouble working properly. Honestly disappointed!! Cloud business is becoming a loot business.

1 Like

+1
Fly was really nice before
But now it’s going really bad
A very simple app crashes after it has to process more than 3 requests

1 Like

would have been nice an email for such a critical change, instead of posting here on the threads. i had 2 production incidents over the past 2 days and after this i dont know why i am even trying to stay on this platform. it is really not production ready apart from hobby projects…

2 Likes

People are bitter about the change which is understandable but as a counterargument, I want to share the screenshot of Apdex (0.5s) metric from one of our services:

As you can see, apdex metric is now stable at 0.95, but before Fly throttling caps, there were the days when it went down to 0.89.

3 Likes

I think the majority aren’t mad due to the change, it’s more the lack of communication. The OP posted the update 2 days ago and has been radio silent after.

1 Like

For anyone else with a Django app that is going down on deploy even though things were working fine before this, I was able to resolve this by having two web facing instances and having health checks with our rolling deploy strategy. Basically, the deployment will now try to start-up each machine one at a time and will run your health checks to verify things are good before continuing to the next machine. I believe if your app has really long start up times, the health checks might timeout and fail the deploy, but a failing deploy is better than having an outage. Hopefully this helps someone else.

2 Likes

The shared cpu cores are effectively useless for any app that does work across cores. The performance cores pricing is high enough that we will be moving off Fly to a more traditional setup using VPS.

Death by a thousand cuts at Fly, this was the last one.

1 Like

Is this 5 seconds at 100% for each CPU? Just wanted to confirm if the initial balances scale based on the size of the machine.

According to the docs quotas are shared:

Quotas are shared between a Machine’s vCPUs. For example, a shared-cpu-2x Machine is allowed to run for 10ms per 80ms period, regardless of which vCPU is using that time.

(CPU Performance · Fly Docs)

I originally misunderstood that as well, but recently downgraded some single-threaded Node servers to shared and can confirm that even though one core is utilizing 20% my quotas are ok.

1 Like

Yes, like the baseline quota and max balance, the initial balance also scales with the number of vCPUs, so you can run a shared machine full-throttle (all vCPUs at 100%) for X (=5) seconds on start regardless of size. (Actually it can run for X/0.9375 seconds- since the shared machine continues to accrue quota at the 6.25% / vCPU baseline, it only draws down 93.75%…)

After some internal discussion earlier today and based on feedback in this thread, we shipped a small update to increase this initial balance from 5 to 50 seconds. This quick change should immediately make deploys a bit smoother for apps with a heavier CPU load during boot, while we work on a longer-term solution to preserve balances between machine versions (or some other way to provide and account for burst when needed).

12 Likes

I voiced the same concern at support. For me, it was the balance reset when you trigger a scale memory that caused an outage. Apparently the balance resets to 1%, which of course at startup blows that out immediately.
This was changed to be 10% as far I know, so that’s much better.

I do suggest figuring out how to delay certain startup tasks, to flatten out the cpu usage. I still have to do some work myself to accomplish this.

This sounds great. I’ll reconfigure my dev app back down to the smaller size and see what happens!

Agreed, I hadn’t been following the forums all that closely and wasn’t aware this change was in the works. Luckily I’m not running in production yet as we’re still finishing product development (which is also why we didn’t notice the change while it was being tested late last year). But had I been in production the first time we deployed after the throttling changes were live would have caused an outage. An email or other form of communication would have given us heads up and enabled us to temporarily scale up in anticipation of any potential problems.

1 Like

I was running my workloads on shared-cpu-8x machines, but after this change, they started experiencing severe throttling, to the point where the workload could no longer finish. The workload isn’t entirely CPU-bound; it’s more of a mix of CPU, network, and disk usage. However, it benefits significantly from multiple cores.

As a result, I had to switch to performance-4x machines. Now, I’m paying more for less performance because my workload takes longer to complete due to running at half the concurrency. As someone pointed out, performance machines are quite expensive to operate for anything that needs to be profitable. On the other hand, a 6.25% CPU quota isn’t sufficient for anything beyond simple tasks like querying a database and returning a JSON response.

I chose Fly because I wanted to run more diverse workloads. If all I needed was to host a typical full-stack app, where the backend simply validates requests, queries a database, and formats responses, I would have stayed with one of the popular serverless cloud providers.

Now, I’m already exploring alternatives to move my workloads because I’m stuck paying for costly performance machines that don’t fully utilize the CPUs I’m paying for. The leap from 6.25% to 100% CPU is significant, and the corresponding price increase is just as steep. In my opinion, there should be an intermediate option to bridge this gap.

Lastly, this change feels contradictory to Fly’s core selling point for me: “Run machines only when you need them, stop them when you don’t, and only pay for active compute.” The new credit system immediately throttles a machine if it exceeds its threshold for even a few seconds. Worse, once the workload finishes and the machine is stopped, you don’t regain any of the spent credits. To recover credit, you have to keep the machine running, even if it’s idle. This undermines what I saw as Fly’s key advantage.

5 Likes

Well, a lot of complaints have been made about this on the original thread, and yet we push on anyway. And now here we are, our jobs in the machines crumbling.

Ironically, this change now makes processor performance unpredictable.

And I definitely agree that this change undermines Fly’s key advantage, so I hope this system can be revisited, maybe:

  • Higher baseline in general
  • Higher baseline with more vCPUs (e.g shared-8x gets 50% baseline)
  • Quota at app level or account level
  • Quota accumulates even with no machines running
  • Quota starts at full

etc.

1 Like

My constantly-throttled app seems to be running fine, even if it never has a boost balance.

Is it fine to have an app that never has a balance and is always throttled?

It does seem to cause problems in many cases, but it’s not intended to be an electric fence that you have to avoid…

(That was a huge thread, so hopefully more of the tips and clarifications buried in there will find their way into the official documentation—eventually.)

I’ve been running a Minecraft server on fly.io since September 2024. It worked great this whole time, running mostly on a shared-cpu-2x w/ 3GB of memory. We used it on 2025-01-26 in the evening with a group of about 6 players and noticed a bit of lag.

We were doing another session tonight and so I scaled it up adding an extra gig of ram and doubling my CPU to 4x.

When we tried to play this evening the server crashed. After troubleshooting a handful of things I noticed this new throttling behavior in my Grafana metrics that wasn’t there before. Looking back at our session on the 26th I can see that there was no throttling and the balance stayed at “5ms” the whole time despite CPU usage hovering around 55% which is above the baseline of 6.75%.

After the redeploy the cpu is throttled, the server is unusable, I had to scale it up to performance-2x to get it usable again and it still isn’t back to its original performance level.

Never received a single email warning about this breaking change, which I think is the worst part. If my server was consuming more resources than it should have, it certainly makes sense to enforce such limits. The problems here are that I was never warned, that they only took effect after a deploy, and that it was never expressed until now that my CPU must stay below an arbitrary fractional amount of usage than it could for the past 4 months.

If my CPU was using more than it should have been in the past that should have been expressed as a % over 100.

Time to look for hosting companies that don’t pull the rug out from under my feet without warning.

2 Likes

I was hit by this last night/this morning. My postgres instance is now throttled to death.