Predictable Processor Performance

empz · November 30, 2024, 11:00pm

I’m still experiencing throttling on my worker machines without a logical cause.

This chart shows the machine has over 10s of balance and no CPU utilization whatsoever, the balance drops suddenly.

After the balance is depleted, any further CPU usage is heavily throttled and the worker can’t even complete it tasks which usually takes less than a minute and gets into an unresponsive state, powered on, not being able to do anything with the following logs:

Between this and the constant global incidents, I just can’t see how can I trust this platform to build anything more than just a regular web service…

Can you please give us an update on this matter? Is this really how it’s supposed to work? I don’t think anybody really understand it…

And are we all supposed to get on the $30 support plan to get these kind of questions answered…? Come on guys, you have a great thing going on here, don’t mess it up. I’d have preferred that you increased the price on the shared machines than to come up with this weird throttling mechanism…

khuezy · November 30, 2024, 11:34pm

@btoews is this still true? Sounds like a lot of people are experiencing poorer performance. Or are they just the vocal minority.

charsleysa · December 1, 2024, 12:22am

That chart can be a little misleading because the utilization is only at the time of the metric collection, which I believe is every 15s. So if your utilization spikes then drops between the metrics collections then the metric won’t show that spike.

Instead take a look at the Load Average chart, it should give you a better picture of utilization.

charsleysa · December 1, 2024, 12:32am

Likely this. People that aren’t impacted aren’t likely to comment at all. Even many of those that are impacted will just accept it and change their configuration to meet their workload needs.

For us, our workers are impacted but the heavy spikes are far apart enough to not drain the credit balance, and if it ever does then we’ll monitor to see if there’s any major impact and see if there’s any changes we need to make. Our monolithic API servers barely breach the threshold so it’s not really a concern on the API side, but then again we avoid heavy processing on the same server instead opting for background jobs that are handled by the workers.

Hypermind · December 1, 2024, 10:16am

We are not a minority. We just use more business-oriented strategies, instead of being vocal

For example, if Fly performance crumbles and becomes uncompetitive, we just go elsewhere, including self-hosting. And we do it silently.

khuezy · December 1, 2024, 11:42am

Isn’t that selfish? Why not voice your concerns so fly can improve.

mayailurus · December 1, 2024, 3:51pm

Hm… I think I get where you’re coming from overall.

But the intent, I believe, is that it’s the pressure that’s commercial, not that no one ever gives any hint as to what was off-base.

(@Hypermind has given some of the best “what is off” feedback in this entire forum, in my opinion.)

The Fly.io that we all know and love exists only in equilibrium—a balance between what they want to (and can afford to!) provide, and what we care to use. Like the hot-air balloon that is its emblem, it stays in the air purely by the arts and continuous attention of Archimedean buoyancy.

Fisticuffs in the wicker basket suspended, swayingly, way up there would surely not improve this future…

empz · December 1, 2024, 3:56pm

My worker-type machines work fine for a while until they don’t because they are being heavily throttled. My jobs start failing because of timeouts when this happens. I’ve asked before if I’d be affected by this a while back because the workers are only created/started to perform work and then stopped/deleted. Do I need to keep them started for a while after they’ve performed its job to recover some balance?

I’m sorry, but I just don’t see the “predictable” part in all this. A machine doing the same kind of work will behave just fine for a few jobs until it doesn’t. I have no way to control this or even know beforehand. How am I supposed to tell the processes running on the machines “hey, you’re running on a Fly.io shared machine, please don’t use over 6.25% of your overall CPU”?

Hypermind · December 1, 2024, 4:21pm

I guess you need a larger machine for demanding worker jobs. I was able to get away with scaling our fleet to shared-2x, shared-4x instances. Also it is worth to pay the closest attention to kswapd process to avoid RAM thrashing by non-stop swapping to the swap file and back. This is what was killing one of our apps - CPU has spent most of the time in kswapd and it brought the whole balance of performance way down. It is so easy to miss when you have 0 - 6.25 CPU load scale.

It was my experience as well, 0-6.25 scale is so weird and not future-proof. It is an axiom that newer and better CPUs are coming to replace the existing AMD EPYCs. So that magical 6.25 value has a persistent natural pressure to change to something else, like 3.125 and lower.

P.S. Fly provides incredible value in terms of CPU power / price ratio now, pretty far ahead of its competitors. But it is like a free cheese in a mouse trap, it cannot last forever. CPU power has associated costs, electricity, amortization, etc.

empz · December 1, 2024, 4:30pm

I’m already running my workloads on shared-8x/8gb instances. Upgrading to performance-8x would be a huge cost increase I cannot take.

I do share that Fly cost for shared machines is extremely competitive, but this change makes it more complicated that it should be in my opinion. As I said, I’d much rather see an increase in shared machines pricing, than having to deal with this very complex throttling/balance logic.

Hypermind · December 1, 2024, 4:34pm

As far as I understand, the next level after the shared-8x tier is performance-1x which is twice as powerful as shared-8x.

empz · December 1, 2024, 4:50pm

shared-8x has 8 cores, shared, but 8 in the end.
performance-1x has only a single core.

My workload benefit from multiple cores, but it’s not 100% CPU-bound so it’s not worth to go for performance machines.

A few months ago, way before this was announced, I benchmarked with all type of machine sizes for my workloads and the best performance and cost was with shared-8x. CPU usage was averaging around 40% so I thought that was perfectly fine and going with performance machines wasn’t needed. Now we are told 40% CPU usage is way above the allowed quota for shared machines.

Hypermind · December 1, 2024, 5:01pm

Maybe it does not matter too much because OS task scheduler multiplexes the CPU core anyway (among the running threads). In theory, the only thing the number of CPU cores affects is the response latency. But in practice, however, it also affects corner cases of a particular app.

Hypermind · December 1, 2024, 5:11pm

Taking into account the highly informative feedback received from @empz, it becomes evident that Fly may consider providing additional CPU tiers that would combine the traits of the shared and performance plans. For example:

flex-2x
flex-4x
flex-8x

For example, flex tier could be the same thing as shared but with a 50% cap per core instead of 30 or 6.25.

chris48s · December 3, 2024, 6:33pm

Would it be possible to get an update from the fly team on what is happening next?
It feels like we are kind of in limbo at the moment.

btoews · December 4, 2024, 3:33pm

I’ve been away for the past several days, hence the silence. The rollout to 100% unfortunately happened on the same day as a few incidents that caused widespread problems. We’re currently discussing our plans to resume the rollout as well as any changes that we ought to make before doing so. We’ll update you here when we figure out our plan. For now, I can say that we’ll give at least a week’s notice before increasing quota enforcement again.

btoews · December 6, 2024, 4:40pm

To give people ample warning and to avoid making a big change in during the holidays, we’re postponing the 100% rollout until January 6th.

chris48s · January 7, 2025, 5:33pm

The last message from fly staff said 6th Jan.
It is 7th Jan now.
Quotas are still at 29.7%.
Has this been postponed again? If so, do you have an updated date for us?

btoews · January 7, 2025, 6:23pm

Sorry about that. I meant to post an update yesterday. I’m trying to get a few more tweaks out before finishing the rollout. The plan is to fully enable CPU quotas a week from today, on January 14th.

tj1 · January 7, 2025, 6:27pm

Is it going to be set at 100% for one hour first and what will the final quota % be?

Topic		Replies	Views
CPU Quotas Update Fresh Produce machines	38	2001	May 15, 2025
shared-cpu-1x performance on CPU bound tasks Questions / Help	7	2674	August 2, 2022
A price tier between shared and performance? wishlist , machines	4	104	June 4, 2025
CPU performance on a Laravel API Questions / Help laravel , machines , autoscaling	7	131	November 22, 2024
Fly VMs specs	3	3192	May 8, 2023

Predictable Processor Performance

Related topics