Predictable Processor Performance

We received the email for several of our worker processes (e.g. 9080e333b12058) and resized them. All of their utilization is consistently below the baseline but our balance is still decreasing. As best as I can tell it looks like steal time is counting against our balance but the description makes it sound like it is time our process is blocked waiting for a core.

Is that expected?

Thanks!

That is not expected! CPU steal shouldn’t be counting against you. I’ll dig into what’s going on here.

I’ve looked at my app’s usage and I think there is a major flaw with the methodology you are switching to.

The issue is if I’m paying for utilization on one machine but not utilizing it, it is not credited toward the machines I am over utilizing for. This wouldn’t be an issue if I built out the app from the start to meet the new resource limits, but having the limits changed without grandfathering means the amortization of resource costs changes completely and not in my favor. Don’t get me wrong, I think its reasonable to enforce more strict resource limits, but I am not satisfied with this roll out strategy.

For example, I have one process in particular that utilizes a lot of resources, but several that utilize practically nothing other than memory. Basically, I am drastically over paying for the resources I am not using on most machines, but am drastically under paying for resources I use on the one process. If I had resource limits from the start on the overutilizing machine, my architecture would be different to account for them, and things would probably cost about the same and usage would be about the same (perhaps a bit worse, but not by a huge amount). However, I didn’t have those limits when I built out the system so it isn’t possible to easily change things around to account for these limits now, despite the fact I am overall not abusing resources dramatically. As it is, my machine costs will just about double despite the fact that I will have all those under-utilized machines sitting around doing nothing.

A complication in my case is this project is a side project and I don’t have much time to dedicate to it. I think Fly in general caters to people like me, so asking me to completely change my architecture in a few weeks or pay double is a real slap in the face I don’t appreciate. As you mention, most customers won’t be affected, but it isn’t my fault your resources weren’t set up in a way that prohibited overutilization when I signed up. I had no idea I was overutilizing as much as I was to begin with. Cases like mine where I wasn’t actively going out of my way to overutilize all my machines, and my overutilization on a process is just a factor of the business model my project has, should have a better option for proceeding than what you are providing here.

2 Likes

Also I think this is another bug if I understand the graph correctly:

If “baseline” is my allotted resources, and “balance” is the resource bank, then I think the balance should be 0 not a constant 5 ms.

Thanks for the feedback. I’ll pass that along.

As for the balance metric, 5ms is the baseline quota per 80ms period and is the minimum value for that metric.

1 Like

Would it be possible to document this chart so it is easier to understand?

You can find some documentation here.

2 Likes

I read all the threads but am still confused about my instance graph.

It looks like the cpu utilization is always under base line but its balance doesn’t increase at all. And I receive an email that this machine is throttled for 24h.

I don’t want to step on Ben’s toes here but I do want to make sure we’re clear: this isn’t an economic issue. The status quo, where we essentially weren’t enforcing quotas and some users found utility in exploiting that gap, is not sustainable; we can’t grandfather it. That’s because lax scheduling is actively causing problems for other customers. More customers are harmed by the current scheduling behavior than are impacted in any way by the proposed change.

We don’t want anybody to have to pay more, but if you’re committed to consistently getting 50% of a core’s utilization out of a shared-1x, that’s a use case we probably can’t maintain that much longer.

Let me know if I’ve misconstrued anything you’ve said! Mostly, just writing here to keep clarity about our intentions.

We do have flexibility in how we roll this out, so if there are things we can do there, we’re all ears.

I don’t see this fancy graph on my Grafana Dashboard. How can I access/configure it?

You’re right to be confused. It seems like your machine should have an increasing balance. We’re trying to track down a bug with the CPU accounting that seems to be affecting a handful of apps/machines.

1 Like

It’s only on the per-instance dashboard.

1 Like

I’m not sure if I’m reading you right, but it seems like you’re implying I and other people who are over-utilizing did it on purpose to “exploit” the system.

Firstly, that couldn’t be further from the truth.

There was no clear indication I was over utilizing at all, which is why I am. I assumed the limit to the amount of processing I was allocated was the amount I was receiving, since that is how most VPSes work. That graph has no documentation on it describing what the parts of it mean, and there were no warnings indicating I was over utilizing.

I bet the vast majority of users in my situation are overusing resources because you guys didn’t communicate that to us at all.

Secondly, the market for hosting depends on over-selling resources. It is simply the way the business works that most users do not use most of their resources and a few users use a lot of it. If you were crediting us the amount of resources we don’t use on any given machine, like what would be “fair”, you’d make a tiny fraction of the money you do. So if I understand what you are implying correctly, why do you think its appropriate to have only under utilizing users such that you only get the profitable side of the usual hosting situation? Having some arbitrary standard that says that the people who do use their resources are “exploiting” anything is a really selfish way of looking at things.

You can spin it as “its good for the customers” but really you can always deploy more metal to fix this if it were economically possible. It is not economically possible to do that, which is reasonable, but you should not blame this on me or any of your customers. It was your (fly’s) fault for getting into this situation. Acting vindictive against customers is not a good stance.

Thirdly, I did not ask for grandfathering, I asked for another better solution for customers like me. I hope you have nothing to do with that solution because you seem like you do not have a good attitude and are assuming too many things about people.

4 Likes

Can you explain the sentence

Quotas will be enabled at 25%

I can’t quite grok what the effect will be when this happens.

Sorry, I’m in engineer-speak mode, and mean “exploit” only in a hyperliteral sense. Also, I’m a vulnerability researcher by background, so “exploit” is a compliment. But no, I do not think you were doing anything untoward with your application! We sympathize with the position you’re in, which is why it took us months to work out the comms plan here, and why we’ve been advisory-only monitoring quotes for weeks before presenting this.

This really is more of an engineering problem than an economic one. Yes, we can add more metal. But the lag time on getting new metal provisioned runs into plural days.[*]

Recall that the problem we’re addressing is (mostly) people with shared- Machines that have been running under lax quota enforcement taking outsized resources suddenly finding their apps adversarially throttled by higher-priority performance- (or, for that matter, large numbers of new shared-) instances on the same physical server. That’s a bad experience, and it can come up quickly.

[*] this being the engineering problem we have to solve before we even talk about the relationship between our margins and our pricing.

1 Like

Let’s use a shared-1x machines as an example. It’s currently permitted to run 100% of the time. With quotas fully enabled, it will be able to run 6.25% of the time. When I say 25% enabled, I mean closing that gap by 25%. So, it will be limited to running ~77% of the time.

1 Like

Kinda, keeping in mind that the shared-8x is 8 cores and performance-1x is 1 core.

Multi-core applications will suffer with only 1 core when using performance-1x, but on the other hand if those Multi-core applications utilize more than 6.25% sustained CPU utilization per core then you’ll quickly be throttled when using shared-8x.

The difference really is how long you can maintain a sustained load. Shared is limited to 6.25% while performance is limited to 62.5%.

The credit system is similar to what’s used in other clouds like the GCP Shared Core instances or the AWS T3 instances. It allows for exceeding the load limits in time limited bursts which can be gained through time that your load is under the limit.

5 Likes

Your seriously need to work on your communication. First of all, announcing changes for 2 days later? You realize many of us are using fly for a hobby site, meaning we have other things to think about most of the time. Other companies make this kind of announcements weeks or months in advance. 2 days is unreal.

When I read the email, I had no idea what this was about. “Your machine are underprovisioned and will be throttled”. Really, where do I see that? No mention of any tool that would show that. What does throttled even mean? The only piece of valuable on info is a link to this tread, and wow, it’s like if I’m not an engineer at fly, I have no chance of understanding the intro.

Here is something that would have made more sense: Hi please look at this graph for your machines, it shows that you’re using more than the % of CPU you’re supposed to use on the free plan. If you plan to keep it, please know that <explain in plain English what will happen (site down, slow site, I don’t know)>. If you want your site to be up at all times, please consider upgrading to the < plan name here > plan. This will ensure that your site will <do what? work better, with no throttling?>.

Thank you for your understanding.

5 Likes

Hey! I take your point about the communications. You’re seeing us work this out in real time as we go, we will certainly improve the emails we send out. Thanks for calling it out; it’ll help us do better.

To be clear(er than maybe the email was): we’re not asking you to do anything different. You’re not doing anything wrong.

What’s happening here is: you’re using a machine class whose documented performance envelope is 1/16th of a CPU core, and, because we have not actively enforced that allocation, your app has (potentially) been claiming significantly more CPU time than that. You don’t have to do anything about this.

However: our resource scheduling is, over the next quarter, going to get more predictable and closer to our documentation. You received that mail because you’re part of a small fraction of our customers who might see different performance as a result of the change. Most applications here aren’t going to be sensitive to whether shared-1x has unlimited burst to a full core! Yours might be, and we think you deserve a heads up about that.

If you want to tell us more about your application, we can talk more about whether the change matters, and whether you might want to change the allocation you’re using.

Regarding this, will the utilization limits be tracked on a per core or per instance basis?

If it is per instance, it seems to reason that since burst utilization will be allowed, a shared-8x instance could consume a sustained single core workload of 50% of a core if neighbors allow. In effect, it would have 20% less sustained single core performance of a performance-1x even though you have access to 8 virtual cores, with the main limitation being low priority compared to neighbors.

1 Like