After the balance is depleted, any further CPU usage is heavily throttled and the worker can’t even complete it tasks which usually takes less than a minute and gets into an unresponsive state, powered on, not being able to do anything with the following logs:
Between this and the constant global incidents, I just can’t see how can I trust this platform to build anything more than just a regular web service…
Can you please give us an update on this matter? Is this really how it’s supposed to work? I don’t think anybody really understand it…
And are we all supposed to get on the $30 support plan to get these kind of questions answered…? Come on guys, you have a great thing going on here, don’t mess it up. I’d have preferred that you increased the price on the shared machines than to come up with this weird throttling mechanism…
That chart can be a little misleading because the utilization is only at the time of the metric collection, which I believe is every 15s. So if your utilization spikes then drops between the metrics collections then the metric won’t show that spike.
Instead take a look at the Load Average chart, it should give you a better picture of utilization.
Likely this. People that aren’t impacted aren’t likely to comment at all. Even many of those that are impacted will just accept it and change their configuration to meet their workload needs.
For us, our workers are impacted but the heavy spikes are far apart enough to not drain the credit balance, and if it ever does then we’ll monitor to see if there’s any major impact and see if there’s any changes we need to make. Our monolithic API servers barely breach the threshold so it’s not really a concern on the API side, but then again we avoid heavy processing on the same server instead opting for background jobs that are handled by the workers.
Hm… I think I get where you’re coming from overall.
But the intent, I believe, is that it’s the pressure that’s commercial, not that no one ever gives any hint as to what was off-base.
(@Hypermind has given some of the best “what is off” feedback in this entire forum, in my opinion.)
The Fly.io that we all know and love exists only in equilibrium—a balance between what they want to (and can afford to!) provide, and what we care to use. Like the hot-air balloon that is its emblem, it stays in the air purely by the arts and continuous attention of Archimedean buoyancy.
Fisticuffs in the wicker basket suspended, swayingly, way up there would surely not improve this future…
My worker-type machines work fine for a while until they don’t because they are being heavily throttled. My jobs start failing because of timeouts when this happens. I’ve asked before if I’d be affected by this a while back because the workers are only created/started to perform work and then stopped/deleted. Do I need to keep them started for a while after they’ve performed its job to recover some balance?
I’m sorry, but I just don’t see the “predictable” part in all this. A machine doing the same kind of work will behave just fine for a few jobs until it doesn’t. I have no way to control this or even know beforehand. How am I supposed to tell the processes running on the machines “hey, you’re running on a Fly.io shared machine, please don’t use over 6.25% of your overall CPU”?
I guess you need a larger machine for demanding worker jobs. I was able to get away with scaling our fleet to shared-2x, shared-4x instances. Also it is worth to pay the closest attention to kswapd process to avoid RAM thrashing by non-stop swapping to the swap file and back. This is what was killing one of our apps - CPU has spent most of the time in kswapd and it brought the whole balance of performance way down. It is so easy to miss when you have 0 - 6.25 CPU load scale.
It was my experience as well, 0-6.25 scale is so weird and not future-proof. It is an axiom that newer and better CPUs are coming to replace the existing AMD EPYCs. So that magical 6.25 value has a persistent natural pressure to change to something else, like 3.125 and lower.
P.S. Fly provides incredible value in terms of CPU power / price ratio now, pretty far ahead of its competitors. But it is like a free cheese in a mouse trap, it cannot last forever. CPU power has associated costs, electricity, amortization, etc.
I’m already running my workloads on shared-8x/8gb instances. Upgrading to performance-8x would be a huge cost increase I cannot take.
I do share that Fly cost for shared machines is extremely competitive, but this change makes it more complicated that it should be in my opinion. As I said, I’d much rather see an increase in shared machines pricing, than having to deal with this very complex throttling/balance logic.
shared-8x has 8 cores, shared, but 8 in the end.
performance-1x has only a single core.
My workload benefit from multiple cores, but it’s not 100% CPU-bound so it’s not worth to go for performance machines.
A few months ago, way before this was announced, I benchmarked with all type of machine sizes for my workloads and the best performance and cost was with shared-8x. CPU usage was averaging around 40% so I thought that was perfectly fine and going with performance machines wasn’t needed. Now we are told 40% CPU usage is way above the allowed quota for shared machines.
Maybe it does not matter too much because OS task scheduler multiplexes the CPU core anyway (among the running threads). In theory, the only thing the number of CPU cores affects is the response latency. But in practice, however, it also affects corner cases of a particular app.
Taking into account the highly informative feedback received from @empz, it becomes evident that Fly may consider providing additional CPU tiers that would combine the traits of the shared and performance plans. For example:
flex-2x
flex-4x
flex-8x
For example, flex tier could be the same thing as shared but with a 50% cap per core instead of 30 or 6.25.
I’ve been away for the past several days, hence the silence. The rollout to 100% unfortunately happened on the same day as a few incidents that caused widespread problems. We’re currently discussing our plans to resume the rollout as well as any changes that we ought to make before doing so. We’ll update you here when we figure out our plan. For now, I can say that we’ll give at least a week’s notice before increasing quota enforcement again.