We’re rolling out changes that will make CPU performance more predictable on Fly Machines. For almost all our customers, these changes strictly improve the experience of running apps here.
Back in 2020, we introduced the notion of shared
and performance
CPU classes. Web applications have very bursty CPU utilization, and Linux is pretty good about scheduling; shared
CPUs take advantage of that to oversubscribe CPUs, which drastically reduces the cost of Fly Machines that use them.
Under the hood, we’ve implemented this by allocating 1/16th of a core to shared
CPUs, and 10/16 to performance
CPUs. We did this using cgroups and cpu.shares
.
cpu.shares
is a relatively high-level control; it just advises the scheduler what the priority of different processes should be under contention. But when there isn’t contention (a not-uncommon situation), cpu.shares
doesn’t have a lot to say about how much your Fly Machine is consuming. On a lightly-loaded physical host, a shared
Fly Machine can eat up a whole core.
This doesn’t bother us in principle! But, over the last several months, it has created a bad experience for users. That’s because lax scheduling works fine until actual contention emerges.
Here’s what happens: your shared-cpu-2x
Fly Machine is chugging along, allocated two 1/16th slices of physical cores but using in reality something more like 20%. All of a sudden, someone lights up a dozen performance-cpu-4x
Fly Machines. Now the scheduler has decisions to make, and those 1/16th allocations have teeth. The worst part about this is that performance is unpredictable; it depends on who’s running what.
The Linux CFS scheduler gives us knobs to make CPU scheduling predictable. We’re rolling out scheduling code based on cpu.cfs_quota_us
, which gives us control over CPU utilization at microsecond granularity. shared
and performance
Machines should now see consistent performance regardless of whether someone has randomly added or removed large loads from the server their Machine is running on.
Because most workloads are bursty, strict CPU quotas on Machines don’t fit Fly Machines that well. To make bursty workloads chug along without having you reconfigure anything, we implemented a userland burst quota system. We run a ~12hz process that dynamically adjusts quotas. In a typical bursty web app load, Fly Machines spend most of their time well below their CPU allocation; our dynamic schedule adjuster builds up credit for that time, allowing them to burst far past their allocation when requests come in (and also giving us better tooling for managing contention).
When we say this change won’t be noticeable to the overwhelming majority of our users, we mean it: the guts of this change have been rolled out for weeks, and we’ve been tracking quota usage. A tiny fraction of organizations on Fly.io, which have been benefiting from our lax scheduling, will lose some performance from this change. We’re reaching out to them directly.
We’re rolling out “enforcement” of quotas slowly, and we’ll keep communicating as we do so. The problem we’re trying to solve here isn’t economic; it’s a form of “noisy neighbors” that has been irritating customers for months. All that is to say, we’ve got a lot of flexibility in how this gets deployed, and we’ll use it, being watchful about impacting anyone’s performance expectations.
This is part of a package of things we’re doing to smooth out capacity and keep performance predictable (something our customers tell us over and over again is a priority). We’re also improving our orchestration to help make sure groups of related Machines don’t get stuck on the same physical hardware contending against each other, and improving our metrics dashboards to more reliably communicate utilization given how our scheduling actually works.
Questions welcome! This is going to be an ongoing project, and we’re happy to keep you in the loop while it happens.
Rollout Schedule (updated October 16)
All changes will be made around noon EDT. These changes require a human to click buttons though, so the timing wont be precise. We’ll post updates to this thread when changes take effect. This schedule is also likely to change and we’ll post updates here.
- October 15 - We’ll email organization admins whose apps would have been throttled for >50% of the prior 24h
- October 24 - Quotas will be enabled at 25% for one hour
- October 29 - Quotas will be enabled at 25%
- October 31 - Quotas will be enabled at 50% for one hour
- November 5 - Quotas will be enabled at 50%
- November 7 - Quotas will be enabled at 75% for one hour
- November 12 - Quotas will be enabled at 75%
- November 14 - Quotas will be enabled at 100% for one hour
- November 19 - Quotas will be enabled at 100%
Updates - October 15
To the folks that received the email: I’m realizing that there are some cases where the data might be misleading. In order to estimate how much time your machines would have been throttled if quotas were enabled, we allowed balances to go negative. This assumes that the machine would have eventually used the same amount of CPU time, even if it had been throttled.
In this example, the machine backed off for a long period of time. If the quotas had been enforced and the machine backed off like this, it would have accumulated a quota balance again. Because quotas were not enforced though, the balance had gone negative enough that the balance never re-accumulated.
If you have a machine that shows a low/zero balance despite not using significant CPU recently, you might not need to worry about scaling the machine. When we start rolling out CPU quota enforcement, balances will be reset and will no longer be able to go negative.
Updates - October 16
The rollout schedule above has been updated. We found a bug in our CPU accounting that is causing incorrect quota balances for a small number of machines. We’ll provide further updates once a fix has been deployed.
Updates - October 22
We’ve decided to give performance
vCPUs a 100% CPU quota. This is more consistent with what folks expect when they hear “performance” and is more consistent with the pricing difference between shared
and performance
vCPUS.
Updates - October 24
Quotas were enabled at 25% from approximately 12:00 to 12:37 EST. A spike in system load on several servers required us to end the 1 hour test period early. We’ve identified and resolved the source of the unexpected load and will be continuing the rollout schedule as planned.
Updates - October 29 & 31
The rollout has been continuing on schedule.