Predictable Processor Performance

We’re rolling out changes that will make CPU performance more predictable on Fly Machines. For almost all our customers, these changes strictly improve the experience of running apps here.

Back in 2020, we introduced the notion of shared and performance CPU classes. Web applications have very bursty CPU utilization, and Linux is pretty good about scheduling; shared CPUs take advantage of that to oversubscribe CPUs, which drastically reduces the cost of Fly Machines that use them.

Under the hood, we’ve implemented this by allocating 1/16th of a core to shared CPUs, and 10/16 to performance CPUs. We did this using cgroups and cpu.shares.

cpu.shares is a relatively high-level control; it just advises the scheduler what the priority of different processes should be under contention. But when there isn’t contention (a not-uncommon situation), cpu.shares doesn’t have a lot to say about how much your Fly Machine is consuming. On a lightly-loaded physical host, a shared Fly Machine can eat up a whole core.

This doesn’t bother us in principle! But, over the last several months, it has created a bad experience for users. That’s because lax scheduling works fine until actual contention emerges.

Here’s what happens: your shared-cpu-2x Fly Machine is chugging along, allocated two 1/16th slices of physical cores but using in reality something more like 20%. All of a sudden, someone lights up a dozen performance-cpu-4x Fly Machines. Now the scheduler has decisions to make, and those 1/16th allocations have teeth. The worst part about this is that performance is unpredictable; it depends on who’s running what.

The Linux CFS scheduler gives us knobs to make CPU scheduling predictable. We’re rolling out scheduling code based on cpu.cfs_quota_us, which gives us control over CPU utilization at microsecond granularity. shared and performance Machines should now see consistent performance regardless of whether someone has randomly added or removed large loads from the server their Machine is running on.

Because most workloads are bursty, strict CPU quotas on Machines don’t fit Fly Machines that well. To make bursty workloads chug along without having you reconfigure anything, we implemented a userland burst quota system. We run a ~12hz process that dynamically adjusts quotas. In a typical bursty web app load, Fly Machines spend most of their time well below their CPU allocation; our dynamic schedule adjuster builds up credit for that time, allowing them to burst far past their allocation when requests come in (and also giving us better tooling for managing contention).

When we say this change won’t be noticeable to the overwhelming majority of our users, we mean it: the guts of this change have been rolled out for weeks, and we’ve been tracking quota usage. A tiny fraction of organizations on Fly.io, which have been benefiting from our lax scheduling, will lose some performance from this change. We’re reaching out to them directly.

We’re rolling out “enforcement” of quotas slowly, and we’ll keep communicating as we do so. The problem we’re trying to solve here isn’t economic; it’s a form of “noisy neighbors” that has been irritating customers for months. All that is to say, we’ve got a lot of flexibility in how this gets deployed, and we’ll use it, being watchful about impacting anyone’s performance expectations.

This is part of a package of things we’re doing to smooth out capacity and keep performance predictable (something our customers tell us over and over again is a priority). We’re also improving our orchestration to help make sure groups of related Machines don’t get stuck on the same physical hardware contending against each other, and improving our metrics dashboards to more reliably communicate utilization given how our scheduling actually works.

Questions welcome! This is going to be an ongoing project, and we’re happy to keep you in the loop while it happens.

Rollout Schedule (updated October 16)

All changes will be made around noon EDT. These changes require a human to click buttons though, so the timing wont be precise. We’ll post updates to this thread when changes take effect. This schedule is also likely to change and we’ll post updates here.

  • :white_check_mark: October 15 - We’ll email organization admins whose apps would have been throttled for >50% of the prior 24h
  • :white_check_mark: October 24 - Quotas will be enabled at 25% for one hour
  • :white_check_mark: October 29 - Quotas will be enabled at 25%
  • :white_check_mark: October 31 - Quotas will be enabled at 50% for one hour
  • November 5 - Quotas will be enabled at 50%
  • November 7 - Quotas will be enabled at 75% for one hour
  • November 12 - Quotas will be enabled at 75%
  • November 14 - Quotas will be enabled at 100% for one hour
  • November 19 - Quotas will be enabled at 100%

Updates - October 15

To the folks that received the email: I’m realizing that there are some cases where the data might be misleading. In order to estimate how much time your machines would have been throttled if quotas were enabled, we allowed balances to go negative. This assumes that the machine would have eventually used the same amount of CPU time, even if it had been throttled.

In this example, the machine backed off for a long period of time. If the quotas had been enforced and the machine backed off like this, it would have accumulated a quota balance again. Because quotas were not enforced though, the balance had gone negative enough that the balance never re-accumulated.

If you have a machine that shows a low/zero balance despite not using significant CPU recently, you might not need to worry about scaling the machine. When we start rolling out CPU quota enforcement, balances will be reset and will no longer be able to go negative.

Updates - October 16

The rollout schedule above has been updated. We found a bug in our CPU accounting that is causing incorrect quota balances for a small number of machines. We’ll provide further updates once a fix has been deployed.

Updates - October 22

We’ve decided to give performance vCPUs a 100% CPU quota. This is more consistent with what folks expect when they hear “performance” and is more consistent with the pricing difference between shared and performance vCPUS.

Updates - October 24

Quotas were enabled at 25% from approximately 12:00 to 12:37 EST. A spike in system load on several servers required us to end the 1 hour test period early. We’ve identified and resolved the source of the unexpected load and will be continuing the rollout schedule as planned.

Updates - October 29 & 31

The rollout has been continuing on schedule.

14 Likes

Will there be a metric made available so we can see how many credits are available and how many are being used?

Will there be a way to turn off bursting? I think there might be some that prefer not to have bursting, and it would also be useful for measuring baseline performance.

Why was 10/16 chosen for performance cores?

1 Like

Yep. There’s some added stats that are documented here. I just fixed a broken image on that page, so it might not look right for a minute.

Not currently. You should be able to emulate that with cgroups inside of the machine though. I can test that out later to make sure it works.

To be 10x the performance of shared machines.

2 Likes

Update: Emails have gone out to organizations with machines that would have been throttled for >50% of the last 24 hours.

I got one of those and I don’t think I understand:

e82d92dc696758 - shared-1x - throttled for 24.0h

Meanwhile I got a balance of 8.33 min over the whole time span according to the Grafana chart. Can you provide me with enlightenment?

Are you sure you’re looking at the correct machine? On our internal dashboard, I see that machine’s quota balance as empty for the entirety of the past 24h.

1 Like

Damn it, you’re right! I looked at the machine which starts with e28, not e82…

To the folks that received the email: I’m realizing that there are some cases where the data might be misleading. In order to estimate how much time your machines would have been throttled if quotas were enabled, we allowed balances to go negative. This assumes that the machine would have eventually used the same amount of CPU time, even if it had been throttled.

In this example, the machine backed off for a long period of time. If the quotas had been enforced and the machine backed off like this, it would have accumulated a quota balance again. Because quotas were not enforced though, the balance had gone negative enough that the balance never re-accumulated.

Let me know if that’s confusing and I can try to explain it further.

Hi, I got one of those emails for 4d89442b154587, and was a little surprised because it’s a little app. There is a spike in the last two days, but it looks like the user portion didn’t really change. Any idea what’s going on here? Thanks!

Hi, I got one of those emails as well. And I don’t understand how that’s possible.

Why is my app being throttled while the load average is way below the 100%.

This is probably one of the cases I describe above in Predictable Processor Performance - #8 by btoews. This machine was probably working hard earlier and ended up with a negative balance. The balance got back to zero and then went positive around 10/11. I’m wondering if you’re looking at the wrong machine though. The machine in your screenshot wouldn’t have been throttled at all over the past 24h, so you shouldn’t have gotten an email about it.

I couldn’t really say why your machine was using increased CPU without knowing more about it. There are lots of things that the machine could be doing that would fall under system usage.

The machine is correct. It has only one instance. The graph above is from the last 2 days.
My concern is, when Fly.io sells a 2x shared CPU, I expect to see the graphs close to 100%, so I know the app is using too much memory/CPU and I need to allocate more.

But if you check the last 7 days (Image below), that’s not the case, not even close. All Graphs show the usage below the 30%. So why I get this email? Are the graphs wrong?

The graphs show how much of a physical CPU core is being used. Each shared vCPU is allocated 1/16th of a physical core. This is indicated by the “baseline” of 6.25%.

2 Likes

OK, the only thing I could think of was about 12 hours ago I ran a couple more backups than normal, but I run one every six hours, and the resulting dump is 12 mb. Could whatever your scanning was doing be the cause?

Do you have a recommended way of looking at the system part of utilization? It really is a base PG image following the instructions from about nine months ago. Besides the backups, this is the only thing using that database, the spikes are me publishing an event and oban sending emails. I don’t mind paying for more resources if I use them, I just don’t understand that graph and my usage pattern.

@btoews I understand now. But that’s a very bad experience. How can we trust these graphs?

Why are the metrics not related to the amount of CPU we got (1/16)? When I look at it, I would always assume that the resources I have are fine and if there’s any issue, I should look at my application. Which is not the case here. I had no idea my app had been throttled.

Now, the email said that I should increase the CPU for this application, but I have no idea of how much my app needs when I look at those graphs.

1 Like

I’m not super knowledgable about Postgres, but I’ll try to rope in someone who is. It doesn’t surprise me that a database would show a lot of system CPU usage.

It seems like you’re getting 20% less performance for half the price with shared 8x vs performance 1x, with everything else equal. Is this accurate?

I agree that the charts can be confusing. That’s why we’re communicating these changes ahead of time, contacting users, and rolling the enforcement of CPU quotas out slowly. Your app has not been throttled yet. We show the amount of actual CPU usage instead of scaling it to the machine’s CPU allocation because we allow bursting.

It looks like you were contacted about a shared-1x machine that is consistently using 100% CPU. If you want to run one vCPU at 100%, you could scale it to a performance-2x, which has a total of 20/16 CPU quota.

1 Like

That sounds right.