Autoscaling on CPU utilization?

I’m looking at using Fly, but my workload scaling criteria needs to be based on utilization and not requests. Is the only way to do this is by setting up a service that manually calls flyctl? Seems really weird.


What’s your use case?

I have an audio encoding service running on Fly which is a CPU heavy task. It scales down to zero and it autoscales to encode multiple audios concurrently (up to a point).

I don’t really scale based on CPU utilization, bur rather on the number of concurrent encodings.

Would that work for you? I can give you an overview of the setup.

@pier Not OP, but I’m also thinking through autoscaling. How do you track the number of concurrent encodings across your cluster? (I’m assuming Redis or Postgres).

Do you just have your process elect to exit when idle + cluster-wide encodings is below a certain threshold? How do you avoid all of your machines shutting down?

In my case, I need to ensure I always have one machine up in my cluster, because part (but not all) of the workload comes through email. I also have some parts of the application which are fired off via webhooks (specifically, from square, when an order is placed).

I could have the proxy manage starting machines based on this webhook traffic; however, the tricky part is that this kicks off a minutes long job, which I run asynchronously, so from the fly proxy’s perspective that machine is no longer serving a request. It’d be nice to be able to signal to the proxy that a machine is at capacity, and I haven’t found out how to do that, yet.

So how my setup works is I have an app with multiple machines created beforehand that can handle peak demand. Asleep machines are very cheap (can’t find the pricing now) so it’s ok to just manually create as many as you think you will need.

In the future I will create machines on demand but I haven’t had the time to play with the machines API yet. And this setup is working great so far.

I explained a scale to zero approach in this post:

I also have another app, let’s call it the job worker, which is permanently awake and orchestrates the whole thing. It responds to a trigger in the Jobs table in Postgres with listen/notify. It could some pub/sub service too.

So the worker sends the HTTP requests to the encoding app to trigger the audio encoding. Basically, Fly’s routing layer wakes the machines as needed when new requests are made to the encoding app. With the auto scaling settings (see the post) Fly will try to use a single request per machine as long as there are available machines to wake up.

One thing to take into consideration is that requests will timeout after 60 seconds if no bytes go through Fly routing layer. So either complete the job before 60s, or stream some bytes so that Fly doesn’t timeout, or just respond the request asap and then do the CPU job.

I don’t really track it, but if I needed to do it I could just do something like this:

select * from "AudioEncoding" where "state" = "ENCODING"

Like I said, all the encoding machines shut down eventually. The job worker who orchestrates all this is always awake.

Like I said, to solve this you could stream some bytes before ending the response. Streaming is a bit of a headache though because you can’t change the HTTP status and there are other considerations.

If you don’t have control over the app making the request, you could ask them to give you a webhook URL you could call when the long running task is complete.

1 Like

Just a standard client facing server that has different workloads depending on the day. When users wake up they are making very expensive requests that they’ll use throughout the day. During the rest of they day they are just reading the result of the computation over and over, but some people are still rerunning the expensive computation. This is easy on AWS with CPU-based autoscaling.

I could have web and worker processes; web would receive webhooks and also check email for incoming work, and dispatch jobs to workers on an https url that blocks until the job is done. That’d make fly autoscale the worker clusters, so it’d work reasonably.

However, it means that I’d need to manually scale the web cluster if it ever got too busy - which is ok, because it wouldn’t be doing anything too expensive, except holding (potentially) lots of open connections to workers. I don’t know how to do zero downtime deploys with this setup either, but there’s probably a way.

I’d really prefer a single role of self-configuring server, though. It makes the architecture cleaner, and also makes it easier to work in the development environment. For that to work I’d need two things:

a) a way to set minimum size for a cluster (this seems like an easy thing that could be added in the processes block); I cannot scale below 1, because as mentioned upthread I need to check email regularly.

b) a way to trigger scaling that’s not related to networking. I looked at the docs for the webhook host; they will close the connection after 10 seconds and then start incrementally backing off and retrying, so I can’t keep the connection open. They also don’t accept streaming bytes. I think the ability to set autoscaling via cpu utilization is a missing feature.

b) a way to trigger scaling that’s not related to networking.

Yeah this is what it all comes down to. One network request could require 1s of processing and another 20ms. Expecting the users to profile their code to find some average requests per second for scaling doesn’t make sense. Also, if you end up on a crowded server, even if Fly is automatically trying to keep machines balanced, it will vary. All the big cloud providers, AWS, Azure use CPU utilization to autoscale.

Would love to hear a representative chime in. This is the biggest blocker for adoption for me.

Your app will receive a signal to shutdown and you can do stuff before shutting down, like finishing a job, and not accepting any news jobs for that particular VM.

In Node this would be something like:

async function closeGracefully () {
	// do stuff here to prepare for shutdown
	await killDbConnection();
	await app.close();

process.once('SIGTERM', closeGracefully);

By default, Fly will wait 5 seconds to kill a VM after sending the SIGTERM signal but you can extend that with the kill_timeout setting. Docs.

You could have an app or machine always up listening to changes on a db or a pubsub service and orchestrating via the machines API. This way you would have total control of the scaling and don’t need HTTP.

You could just move the CPU heavy process to different machines outside your main app like I did and trigger those with HTTP or the machines API. See my long comment above.

via the machines API.

Yep, was looking at that, though I haven’t played with it yet (Working with the Machines API · Fly Docs). If I had an inventory of machines that I’d created ahead of time, and just start/stop them on demand, that seems pretty straightforward. I’m not sure what would be needed to create machines on the fly - it appears that they need a public docker image(?) rather than being able to use an app that’s already set up.

If you run fly m list on an existing app, you can use that image to create new machines. As an example:
fly m list -a silly-test-app

4d896d6a206428	rough-pond-3961 	started	ord   	billy-rust:deployment-01H03C3SMPR09MFBQ6S0DTC3WZ	fdaa:1:3ad4:a7b:8c31:c327:d442:2	      	2023-05-10T17:44:33Z	2023-05-10T17:44:40Z	v2          	app          	shared-cpu-1x:256MB	

You can specify as the image.

Do note though that the machine you’re getting the image from has to be in the same org, so you sadly can’t use billy-rust itself.

1 Like

Respectfully, this doesn’t work for me, man. I shouldn’t have to break up my app app into different services because lacks the most common sense and industry standard autoscaling criterion there is. Introducing a service boundary just to get around it makes zero sense.

1 Like

The are pros and cons to both approaches.

But anyway, you can implement CPU autoscaling yourself with the machines API if you wish to do so.

1 Like

Do you work for or something? Everyone having to implement it themselves is the point of this complaint.

If we don’t offer the feature set you’d like to see, it’s fair to observe that. But Pier’s just an engaged community member sharing solutions, so it’s definitely not his fault! :smile:


Me? Work for Fly? :joy:

I wish.

Yeah it would be great if Fly solved all of our problems. Two years ago I was also asking about CPU autoscaling but I think my current solution is even better as I have more control and more flexibility.

I’ve tried everything under the sun (Google Cloud Run etc) and Fly machines have been really the best solution I’ve found for my use case.


Can I just get an objective analysis from someone at that’s about the actual issue? I’d like to know the official stance on this as far as roadmap before I commit a ton of resources to this platform. I tried off-the norm platforms with hefty bills only to go crawling back to AWS.

If I was a turd, my bad.

@akutruff I’m not but if your service is http then one option is having overloaded / maxed out servers reject requests and include the Fly-Reject header in the response.

This will cause the fly proxy to try another server instead of returning the rejected response to the client.

It would require your app to be aware of its cpu utilization but depending on your use case it might be a suitable solution.

Appreciate the suggestion, but even then that requires a rejection and as you said my application is now coupled to

The only clean and simple path that handles the majority of scenarios is simple scaling criteria just like AWS - hit sustained 80% or whatever utilization for certain time period and spin another up. is doing the “right” thing by making a simple scale up scale down metric with bounds but they are using requests per second which is indirectly trying to measure resource utilization.

Machines are resource boundaries. When a resource is maxed, we need another machine - whether that’s CPU, I/O, GPU, etc. Any other way of defining a scaling trigger will always require someone to run a benchmark to estimate whatever “requests per second” means in their use case. It will always be an attempt to identify a single resource bottleneck. This would have to be done every single build to maintain the same optimality you get with AWS-style scaling criteria. For monolithic apps with heterogeneous time-dependent workloads, that ain’t possible.

1 Like

We actually shipped a preview of metrics based autoscaling to a handful of users. It wasn’t right for them. CPU autoscaling just isn’t all that important to us right now. That might change in the future.

And, this is nuanced, but we’re not using requests per second to scale. We’re using concurrent requests. And it’s (effectively) immediate. Our previous autoscaling was metrics based, and it didn’t react as quickly as people wanted.

You’ll actually get much quicker scaling behavior with Fly-Reject and an app that’s CPU aware. Metrics based autoscaling is inherently laggy. Yes there’s code involved, but I don’t think it’s much more complicated than the code to configure AWS autoscaling with Terraform.

It does sound like you might be better off with AWS though. They already do what you’re looking for!