Predictable Processor Performance

Running a performance-2x machine would let you run a single thread at 100%. You shouldn’t have to worry about pinning it to a core.

We’ve definitely talked about that and will think about making it an option.

1 Like

A number of people have reached out because their machines are using very little CPU, but the charts show that they would be throttled. The cause of this bug is some unexpected overhead from the hypervisor getting included in the stats we’re using for the CPU accounting for a small number of machines.

We should have this fixed soon, but sticking to the current rollout schedule would give folks very little time to see how quotas will impact their machines once the fix is deployed. I’ve tentatively pushed the rollout schedule back by one week. I’ll post an update here once a fix is deployed.

6 Likes

Is it possible to load balance the instances based on CPU usage? I’ve an app that runs in two regions, but only one region is eating up all the quota while the other is way below the baseline.

You can’t directly use CPU balance for load balancing, but you could set a lower soft_limit to cause traffic to be send to the low-load instance after a certain threshold.

Hey @r38y , I took a look at your postgres machine and it looks like the increases in system steal are caused by increases in disk I/O operations. On postgres apps cpu usage due to disk I/O gets reported under system metrics vs user metrics, as they happen below the app level. If you check the instance disks metrics tab, you should see the number of read operations increasing in lockstep with the system CPU increase.

Over the period on 10/14, the machine went from doing ~100 io ops /s up to a peak of ~6000 ops/s before dropping off again. It looks like it would have first started bursting and eating into it’s quota around ~2500 ops/s.

If you expect the machine to operate at that load consistently, then increasing the size to avoid throttling might make sense. But if you expect load to be lower with periodic spikes, then staying at the small size with bursting behaviour might work for you, even if it gets some throttle applied at times. FWIW it looks like a shared-2x machine wouldn’t have needed to burst at all, even at the peak.

I have to agree with others that giving 2 days notice on this change is quite disruptive and disrespectful to our time.

While I understand that this is making performance consistent and inline with what is committed under contract, the reality is we’ve been scaling on this platform for over 3 years and up until now things have been operating fine, so why would we spend our time analysing machine performance to check if you are over-provisioning CPU time to us?

Now we’ve received an email telling us that in two days time a critical piece of our infrastructure is changing it’s performance profile, and my engineers have to drop what they were doing to understand the impact of this and determine the appropriate provisioning to avoid a critical outage for our customers.

Not even Google would give us so little notice of such a change.

3 Likes

I’m trying to build a cost-effective worker/job system, similar to the Rails example shown here: Rails Background Jobs with Fly Machines · The Ruby Dispatch

My use case involves short-lived machines that perform a single CPU-intensive task and then shut down. To maximize my budget, I’m using shared-cpu-8x instances (I need as much cores/threads I can get, but can’t afford performance-8x). I understand these instances are burstable, and I’m okay with some performance variability.

However, it seems like the new CPU credit system will always throttle my machines because they won’t have time to build up a positive credit balance. This defeats the purpose of using on-demand machines for short tasks, as I’ll effectively be paying for CPU power I can’t fully utilize.

Fly.io has always emphasized the ability to run machines only when needed. This change seems to contradict that by requiring a positive credit balance, essentially forcing me to keep machines running longer than necessary.

Could someone clarify how this new system impacts short-lived, CPU-intensive workloads? Are there any recommendations for optimizing cost-efficiency in this scenario?

1 Like

I don’t think this is the case. I took a quick look at our GPU machines which startup ondemand only when we need some AI inferencing done. I saw that the balance seemed to start at maximum.

You should double check as well and see if your machines have a full balance after starting.

If machines start with a full balance, then nothing should really change for you unless your short-lived tasks aren’t so short (over 5 minutes) and manage to use the entire balance.

Now that both CPU accounting and the billing system is sophisticated, has Fly considered letting customers choose an option to bill for CPU time instead? This is how Cloudflare does it for Workers and seems more “Serverless” to me. In fact, it may end up saving money for some but importantly for most, it avoids what will eventually become a monthly/quarterly ritual to right size machines to match rps / qps.

2 Likes

I’m looking forward to seeing the fix for that. I’ve got a machine with almost non-existent usage (NestJS Kafka consumer) that’s supposed to be throttled for 24h/d :sweat_smile:

We’ve talked about it and like the idea, but there are no concrete plans.

2 Likes

We’re considering lots of things about billing, but I want to make sure it’s clear that this is not a billing thing; it’s a user experience thing. The current affordance on the platform of unlimited burst outside of contention is bad. It creates an illusion of compute performance, and then pulls the rug out from under you the moment competing workloads are scheduled on the same hardware.

The change is being rolled out slowly over the course of a month and we’ve pushed the start of that process out by a week. We have flexibility in how we roll this out and are all ears about changing things up, but it’s important for us to communicate that the status quo is not good for users. We’ll give extra time, but during that time, most of our users are receiving a worse experience.

If you have an app that needs lots of extra time, and are seriously concerned that you’ll have performance issues in prod, let me know. There’s probably additional stuff we can do on a case basis.

You’re doing a worker/job system; can I ask: what’s the performance envelope your workers, doing async background jobs, demand from shared- instances? You might just not have to care about this at all. I’m being repetitive about this point but I think it bears repeating: we’re not looking for apps using too many resources and then penalizing them. You can redline your shared-1x indefinitely and get the full allocated performance for that CPU class.

2 Likes

Just in case I haven’t made this tediously clear enough: we won’t be penalizing apps for redlining their CPUs. I think “throttling” is just the wrong term here. You won’t ever get less than the CPU allocation you bought, and your instances will be able to burst to more than that allocation (which is not a capability we’ve previously built in to our CPU classes). I’d worry about someone reading this comment and thinking there’s a possibility that some shared-1x Fly Machine could get 0/16ths of a CPU; nope!

3 Likes

Jumping in here as a happy Fly customer to say thanks to the team - we appreciated the detailed communication, the email listing apps with machines that might be throttled under the new rules (or new enforcement of existing rules), and the tools/guidance/time needed to adjust our usage. The process was easy and quick. We also very much appreciate the end goal: we have had at least two brown-outs related to massive unexpected spikes in CPU steal and are excited that you all are working to solve this.

I will say that for one database in particular we are now paying 6x more than we were previously. In the grand scheme of things this wasn’t a big hit and I do get why this is not technically a billing change, but it did catch us off guard. Previously when we saw sustained 50-60% CPU utilization we assumed that was of our shared slice, not the CPU as a whole. Turns out we were getting the resources of a performance-2x machine at the price of a shared-2x machine. Not a bad deal!

For customers like ourselves having to right-size machines, a simple way to summarize the original communication might be: “We were regularly giving you way more CPU than you paid for with the tradeoff that occasionally you (or someone else) would get less than what you paid for. Now we’re going to stop giving you free lunch in order to guarantee that you (and everyone else) get what you pay for.” (correct me if I’m wrong with this read)

9 Likes

That’s the right read, but I’m wincing that you’re paying 6x more. If you’d like (or anybody else finding themselves in a similar situation), reach out, we’ll see what we can do. Can’t say enough about how this is not an economic project for us.

1 Like

I appreciate the offer, but the reality is that we really do need performance-2x machines where we had previously provisioned shared-2x machines. We played around with more shared cors since this is pg. That worked with all our other databases, but this one must have some index or query problems. Nothing you can fix. It’s tough because you’ve been giving us a whopping 80% discount, not getting credit for it, and are now getting blamed for asking us to pay for what we use :upside_down_face:.

4 Likes

I appreciate you saying that! But we need people to be comfortable running apps here, so if you’re in production and worried we’re about to magnify your bill in a way that you can’t easily metabolize[*], please do reach out; there’s a bunch of knobs we can turn!

[*] again, this mail went out to like 1% of all our customers

4 Likes

I just shipped some improvements to the CPU usage accounting that should help the small number of folks who were seeing their balance drain for no apparent reason. The problem isn’t 100% solved, but is drastically better. We’re still working on getting it fixed the rest of the way.

To get things back to a good state, I’ve reset quota balances across all machines to 1000x the baseline quota. Over the next few days, folks should be able to get a better sense of how quotas will impact their apps.

6 Likes

The change is being rolled out slowly over the course of a month and we’ve pushed the start of that process out by a week. We have flexibility in how we roll this out and are all ears about changing things up, but it’s important for us to communicate that the status quo is not good for users . We’ll give extra time, but during that time, most of our users are receiving a worse experience.

I appreciate why the change is being made and the nature of the change, but the point I’m trying to make is that any change on short notice is disruptive.

We have performance tuned our containers to optimise cost for a market that is tight on funds (and tighter this year)

Our servers host websites for hundreds of charities for their fundraising, if a large portion of their traffic is getting 500 error or timeouts because the containers CPU profile has changed, it doesn’t really matter if it’s 5% or 50% of their donors that see this, it’s unacceptable to our customers because it has direct revenue impact for them.

So we are left with no choice but to reallocate engineering resources to an unplanned interruption to understand the impact of this change and how we need to adjust our infrastructure to prevent downtime.

I appreciate that the timelines have been pushed back, but its still only gives us 1 week to react before the first throttling. Based on your figures in the original post, the available CPU time will be cut by approximately 2/3. That it was ever thought that it was ok to even go 25% of the way to that change at such short notice really makes me question how well you appreciate the impact this could have on your customers.

2 Likes

If this is the case, reach out directly and we’ll work through options. We have knobs we can turn. We’ll get you through this. The fleetwide change is non-optional; as you saw with the preceding comment, it’s causing problems for customers. But if we’re putting you in a bind, we’ll figure a way through it with you.

1 Like