Predictable Processor Performance

Many people have left negative comments on this, but as a neutral person who benefits from this, I would love to thank you for this!

6 Likes

Quotas are now enabled at 75%

1 Like

Running a bit late today (sorry). Quotas will be enabled at 100% for the next hour. I’ll update this thread when I move them back to 75%.

Update 2:50EST: quotas are back to 75%

Good thing I saw this thread, I was panicking about the baseline changing unexpectedly in my dashboards. I have a question regarding these changes; in theory, at a 100% quota the baseline should be 6.25%, but since the quotas are shared between a Machine’s vCPUs, on a shared-cpu-8x, shouldn’t the baseline be 50% (40ms per 80ms period). The reason I ask is because I saw 6.25% across my app during the 100% period (which makes sense if it was a shared-cpu-1x), but even outside that period, I have been seeing 29.7% as the baseline.

Apologies if I am understanding all of this wrong. I just migrated some production applications over the last week, and I am trying to optimize them to run properly here.

Hey Rodolfo,

Your understanding is correct, quotas are shared across vCPUs. For machines with multiple vCPUs, like a shared-8x machine, the utilization and baseline are displayed on the dashboard as an average – It’s the total load divided by the total number of vCPUs in the instance. This is why it still shows the baseline of 6.25%

There’s a per-CPU Utilization panel on the “Fly Instance” tab for a detailed breakdown of utilization across each individual vCPU. But because quotas are shared between a machine’s vCPUs only the average matters to the enforced limits.

Okay. So if during the 100% quota period I saw the baseline at 6.25% and each of my CPUs was above the baseline, that indicates that once implemented I am going to run into throttle issues, correct?

For reference, this is one of my instances

Again, I am trying to find the sweet spot, and understand the numbers. I have been using fly in production since last Friday, and I have gone from shared-cpu-2x to shared-cpu-4x, and then to shared-cpu-8x across 7 machines. If once quotas are at a 100%, and based on what the charts are showing the setup wont be enough, I might need to revisit it.

I love your platform, and how easy is to scale horizontally and vertically, but my performance expectations were a bit higher coming from decent performance on a single DO dropplet. I will keep looking at more possible tweaks to make it work. I have to be missing something.

Looking at this chart, yes you’re most likely going to be throttled. It looks like every vCPU is constantly above 6.5%.

Shared CPU machines really aren’t intended for constant loads across all vCPUs, that’s more the territory of the Performance CPU machines. Shared CPU machines are more appropriate for spiky/uneven workloads.

Yeah. It looks like this machine would eventually get throttled if it kept on using CPU at that level.

It’s hard to give concrete recommendations without knowing more about your app. Depending on the concurrency model of your app, you might be better served by adding a machine or switching to a smaller number of performance vCPUs.

Yeah, I totally understand that. I am going to post a new question with more specific details to stop hijacking this thread, and perhaps you/team might be better able to suggest a better approach. I really want to find a good setup here cost and performance wise, and then move the rest of the company infra here.

Thanks.

1 Like

We’ve decided to postpone the final bump to 100% enforcement of CPU quotas until Monday, November 25th.

Continuing with the rollout schedule, we’re moving to 100% enforcement of quotas today. I’m starting with machines in the CDG region and will be continuing with other regions in a bit.

Moved to 100% enforcement in all regions as 14:02 EST.

1 Like

Did you guys move back to 75% enforcement? Looking at the metrics gives me that impression, but not sure given what has been going on since yesterday with the outage.

1 Like

We did. Sorry for not posting an update here. A number of apps are having issues today and we’re trying to isolate the cause.

2 Likes

Until today, this was the case for me too.

Now, my machines are being heavily throttled. I used to have minutes of balance, now I have just a few milliseconds. My jobs are now taking a lot to complete because of this.

I just don’t understand all this. How am I supposed to tell my app “hey, you’re running on a shared machine, so please just use 6.25% of the CPU” to avoid being throttled?

It looks like deleting my machine and creating a new one gives me plenty of balance to run my jobs. Here’s the profile of 3 jobs where the machine is started to process and then stopped it. All my jobs use about the same CPU usage so I’m not sure how this balance is being consumed. And if this is right, it looks like I can just work around this by simply deleting and creating new machines.

Would like to understand this better…

Maybe you can provide some flag in the [[experimental]] section of a fly.toml file, so that apps can opt in into being properly throttled. I say word “properly” because throttling is never a pleasant thing, but it is needed to be done for justice. And in the future it will be applied in one way or another. So it would be cool to have a flag that turns on that future mode explicitly, so we could prepare our apps.

I have spent half of my day catching bugs, slowdowns and timeouts caused by throttling. I even caught a few race conditions I would never have caught without throttling applied. This is why having a separate opt-in flag is much appreciated, so we could debug our apps more easily beforehand, one by one, while getting a better understanding where particular problem comes from. This will help us to evade a dooms day effect we experienced today.

I also compared Fly throttling to Digital Ocean throttling. In terms of the raw performance, they are almost the same. But Digital Ocean’s one is easier to grasp because you see everything just by running the standard top command in the shell. For example, the CPU load gauge of Digital Ocean’s droplet has a correct CPU measurements in 0-100% range, while Fly shows some virtual implementation-related fractions in 0-6% range. This makes it harder to understand the whole situation, because what I expect from the CPU load is 0-100% range - it is so much easier to reason about. Otherwise, this creates serious perceptional pitfalls.

One such pitfall is kswapd process. By observing the CPU load it creates, it is easy to estimate how understocked the RAM is (causing swap thrashing). But because it is just 0-6% in a Fly app, it falls off the radar. This is why I made a mistake of having a RAM-understocked instance that thrashed CPU by kswapd, and I have not paid enough attention to that because it was “just a few % of CPU, that cannot hurt too much”.

1 Like

I think this is a good idea. It would be easier to prepare for. As much as they made a good effort to let customers know and to stage the quota enforcements, there are simply too many customers that were not prepared, and probably still don’t understand how the quota system works.

I completely agree that going by CPU usage of 0-6.25% is very hard to estimate for and even comprehend. I think most providers give you the usual 0-100% scale, which does not mean you have all the CPU allocated, and in the fly.io case would mean that if you are using 6.25% you are using 100%, if you are using 3.125% then you are using 50% and so on.

I agree that being able to opt in to 100% quota enforcement for testing purposes would really help.

We found it difficult to prepare because none of the 25% / 50% / 75% limits impacted us. I was hoping to the 1 hour at 100% would help, but it wasn’t enough for us to actually run out of CPU credits so we never hit the limit in that time so that didn’t provide any useful data about what would happen to application performance when we actually got throttled.

Our application suffered a big outage yesterday, but I’m unclear to what extent that was because of the instances being throttled and to what extent it was related to the “Degraded API Performance” platform outage that happened around the same time. It was difficult to know what symptoms were the result of which cause.

It would also be really useful to get some clear communication about the timescale for going back to 100% so we are not surprised by a return to 100% quota enforcement.

1 Like

On digital ocean, if the CPU is still 100% when you’re being throttled, how do you tell the difference between your app actually using 100% CPU and your app being throttled?

I believe (based on public docs) that Digital Ocean’s shared CPU is likely implemented through cgroups CPU Shares, which is exactly how shared Fly Machines were originally implemented. A droplet can utilize up to 100% of the CPU as long as the host is lightly loaded, but risks unpredictable future performance if/when a host fills up / experiences higher load and neighboring droplets compete for limited CPU cycles.

As @charsleysa mentioned earlier, the new CPU-quota system is more directly comparable to ‘burstable VM’ products like GCP Shared Core and AWS T3 instances (to this I would add Azure B-Series VMs). Looking at these comparable products, I don’t believe any of them normalize baseline CPU-utilization performance to 100% in the way you describe.