Predictable Processor Performance

smorimoto · November 10, 2024, 7:10pm

Many people have left negative comments on this, but as a neutral person who benefits from this, I would love to thank you for this!

btoews · November 12, 2024, 5:14pm

Quotas are now enabled at 75%

btoews · November 14, 2024, 6:48pm

Running a bit late today (sorry). Quotas will be enabled at 100% for the next hour. I’ll update this thread when I move them back to 75%.

Update 2:50EST: quotas are back to 75%

rodolfo · November 14, 2024, 8:09pm

Good thing I saw this thread, I was panicking about the baseline changing unexpectedly in my dashboards. I have a question regarding these changes; in theory, at a 100% quota the baseline should be 6.25%, but since the quotas are shared between a Machine’s vCPUs, on a shared-cpu-8x, shouldn’t the baseline be 50% (40ms per 80ms period). The reason I ask is because I saw 6.25% across my app during the 100% period (which makes sense if it was a shared-cpu-1x), but even outside that period, I have been seeing 29.7% as the baseline.

Apologies if I am understanding all of this wrong. I just migrated some production applications over the last week, and I am trying to optimize them to run properly here.

Sam-Fly · November 14, 2024, 8:37pm

Hey Rodolfo,

Your understanding is correct, quotas are shared across vCPUs. For machines with multiple vCPUs, like a shared-8x machine, the utilization and baseline are displayed on the dashboard as an average – It’s the total load divided by the total number of vCPUs in the instance. This is why it still shows the baseline of 6.25%

There’s a per-CPU Utilization panel on the “Fly Instance” tab for a detailed breakdown of utilization across each individual vCPU. But because quotas are shared between a machine’s vCPUs only the average matters to the enforced limits.

rodolfo · November 14, 2024, 9:13pm

Okay. So if during the 100% quota period I saw the baseline at 6.25% and each of my CPUs was above the baseline, that indicates that once implemented I am going to run into throttle issues, correct?

For reference, this is one of my instances

Again, I am trying to find the sweet spot, and understand the numbers. I have been using fly in production since last Friday, and I have gone from shared-cpu-2x to shared-cpu-4x, and then to shared-cpu-8x across 7 machines. If once quotas are at a 100%, and based on what the charts are showing the setup wont be enough, I might need to revisit it.

I love your platform, and how easy is to scale horizontally and vertically, but my performance expectations were a bit higher coming from decent performance on a single DO dropplet. I will keep looking at more possible tweaks to make it work. I have to be missing something.

charsleysa · November 14, 2024, 9:32pm

Looking at this chart, yes you’re most likely going to be throttled. It looks like every vCPU is constantly above 6.5%.

Shared CPU machines really aren’t intended for constant loads across all vCPUs, that’s more the territory of the Performance CPU machines. Shared CPU machines are more appropriate for spiky/uneven workloads.

btoews · November 14, 2024, 9:36pm

Yeah. It looks like this machine would eventually get throttled if it kept on using CPU at that level.

It’s hard to give concrete recommendations without knowing more about your app. Depending on the concurrency model of your app, you might be better served by adding a machine or switching to a smaller number of performance vCPUs.

rodolfo · November 14, 2024, 11:06pm

Yeah, I totally understand that. I am going to post a new question with more specific details to stop hijacking this thread, and perhaps you/team might be better able to suggest a better approach. I really want to find a good setup here cost and performance wise, and then move the rest of the company infra here.

Thanks.

btoews · November 18, 2024, 3:57pm

We’ve decided to postpone the final bump to 100% enforcement of CPU quotas until Monday, November 25th.

btoews · November 25, 2024, 5:11pm

Continuing with the rollout schedule, we’re moving to 100% enforcement of quotas today. I’m starting with machines in the CDG region and will be continuing with other regions in a bit.

Moved to 100% enforcement in all regions as 14:02 EST.

rodolfo · November 26, 2024, 3:53pm

Did you guys move back to 75% enforcement? Looking at the metrics gives me that impression, but not sure given what has been going on since yesterday with the outage.

btoews · November 26, 2024, 3:54pm

We did. Sorry for not posting an update here. A number of apps are having issues today and we’re trying to isolate the cause.

empz · November 26, 2024, 4:36pm

Until today, this was the case for me too.

Now, my machines are being heavily throttled. I used to have minutes of balance, now I have just a few milliseconds. My jobs are now taking a lot to complete because of this.

I just don’t understand all this. How am I supposed to tell my app “hey, you’re running on a shared machine, so please just use 6.25% of the CPU” to avoid being throttled?

empz · November 26, 2024, 4:53pm

It looks like deleting my machine and creating a new one gives me plenty of balance to run my jobs. Here’s the profile of 3 jobs where the machine is started to process and then stopped it. All my jobs use about the same CPU usage so I’m not sure how this balance is being consumed. And if this is right, it looks like I can just work around this by simply deleting and creating new machines.

Would like to understand this better…

Hypermind · November 26, 2024, 5:36pm

Maybe you can provide some flag in the [[experimental]] section of a fly.toml file, so that apps can opt in into being properly throttled. I say word “properly” because throttling is never a pleasant thing, but it is needed to be done for justice. And in the future it will be applied in one way or another. So it would be cool to have a flag that turns on that future mode explicitly, so we could prepare our apps.

I have spent half of my day catching bugs, slowdowns and timeouts caused by throttling. I even caught a few race conditions I would never have caught without throttling applied. This is why having a separate opt-in flag is much appreciated, so we could debug our apps more easily beforehand, one by one, while getting a better understanding where particular problem comes from. This will help us to evade a dooms day effect we experienced today.

I also compared Fly throttling to Digital Ocean throttling. In terms of the raw performance, they are almost the same. But Digital Ocean’s one is easier to grasp because you see everything just by running the standard top command in the shell. For example, the CPU load gauge of Digital Ocean’s droplet has a correct CPU measurements in 0-100% range, while Fly shows some virtual implementation-related fractions in 0-6% range. This makes it harder to understand the whole situation, because what I expect from the CPU load is 0-100% range - it is so much easier to reason about. Otherwise, this creates serious perceptional pitfalls.

One such pitfall is kswapd process. By observing the CPU load it creates, it is easy to estimate how understocked the RAM is (causing swap thrashing). But because it is just 0-6% in a Fly app, it falls off the radar. This is why I made a mistake of having a RAM-understocked instance that thrashed CPU by kswapd, and I have not paid enough attention to that because it was “just a few % of CPU, that cannot hurt too much”.

rodolfo · November 26, 2024, 6:01pm

I think this is a good idea. It would be easier to prepare for. As much as they made a good effort to let customers know and to stage the quota enforcements, there are simply too many customers that were not prepared, and probably still don’t understand how the quota system works.

I completely agree that going by CPU usage of 0-6.25% is very hard to estimate for and even comprehend. I think most providers give you the usual 0-100% scale, which does not mean you have all the CPU allocated, and in the fly.io case would mean that if you are using 6.25% you are using 100%, if you are using 3.125% then you are using 50% and so on.

chris48s · November 26, 2024, 6:06pm

I agree that being able to opt in to 100% quota enforcement for testing purposes would really help.

We found it difficult to prepare because none of the 25% / 50% / 75% limits impacted us. I was hoping to the 1 hour at 100% would help, but it wasn’t enough for us to actually run out of CPU credits so we never hit the limit in that time so that didn’t provide any useful data about what would happen to application performance when we actually got throttled.

Our application suffered a big outage yesterday, but I’m unclear to what extent that was because of the instances being throttled and to what extent it was related to the “Degraded API Performance” platform outage that happened around the same time. It was difficult to know what symptoms were the result of which cause.

It would also be really useful to get some clear communication about the timescale for going back to 100% so we are not surprised by a return to 100% quota enforcement.

charsleysa · November 26, 2024, 8:10pm

On digital ocean, if the CPU is still 100% when you’re being throttled, how do you tell the difference between your app actually using 100% CPU and your app being throttled?

wjordan · November 26, 2024, 8:29pm

I believe (based on public docs) that Digital Ocean’s shared CPU is likely implemented through cgroups CPU Shares, which is exactly how shared Fly Machines were originally implemented. A droplet can utilize up to 100% of the CPU as long as the host is lightly loaded, but risks unpredictable future performance if/when a host fills up / experiences higher load and neighboring droplets compete for limited CPU cycles.

As @charsleysa mentioned earlier, the new CPU-quota system is more directly comparable to ‘burstable VM’ products like GCP Shared Core and AWS T3 instances (to this I would add Azure B-Series VMs). Looking at these comparable products, I don’t believe any of them normalize baseline CPU-utilization performance to 100% in the way you describe.

Topic		Replies	Views
CPU Quotas Update Fresh Produce machines	38	1824	May 15, 2025
shared-cpu-1x performance on CPU bound tasks Questions / Help	7	2651	August 2, 2022
A price tier between shared and performance? wishlist , machines	3	70	May 28, 2025
CPU performance on a Laravel API Questions / Help laravel , machines , autoscaling	7	107	November 22, 2024
Fly VMs specs	3	3106	May 8, 2023

Predictable Processor Performance

Related topics