I’m looking at using Fly, but my workload scaling criteria needs to be based on utilization and not requests. Is the only way to do this is by setting up a service that manually calls flyctl? Seems really weird.
What’s your use case?
I have an audio encoding service running on Fly which is a CPU heavy task. It scales down to zero and it autoscales to encode multiple audios concurrently (up to a point).
I don’t really scale based on CPU utilization, bur rather on the number of concurrent encodings.
Would that work for you? I can give you an overview of the setup.
@pier Not OP, but I’m also thinking through autoscaling. How do you track the number of concurrent encodings across your cluster? (I’m assuming Redis or Postgres).
Do you just have your process elect to exit when idle + cluster-wide encodings is below a certain threshold? How do you avoid all of your machines shutting down?
In my case, I need to ensure I always have one machine up in my cluster, because part (but not all) of the workload comes through email. I also have some parts of the application which are fired off via webhooks (specifically, from square, when an order is placed).
I could have the proxy manage starting machines based on this webhook traffic; however, the tricky part is that this kicks off a minutes long job, which I run asynchronously, so from the fly proxy’s perspective that machine is no longer serving a request. It’d be nice to be able to signal to the proxy that a machine is at capacity, and I haven’t found out how to do that, yet.
So how my setup works is I have an app with multiple machines created beforehand that can handle peak demand. Asleep machines are very cheap (can’t find the pricing now) so it’s ok to just manually create as many as you think you will need.
In the future I will create machines on demand but I haven’t had the time to play with the machines API yet. And this setup is working great so far.
I explained a scale to zero approach in this post:
I also have another app, let’s call it the job worker, which is permanently awake and orchestrates the whole thing. It responds to a trigger in the Jobs table in Postgres with listen/notify. It could some pub/sub service too.
So the worker sends the HTTP requests to the encoding app to trigger the audio encoding. Basically, Fly’s routing layer wakes the machines as needed when new requests are made to the encoding app. With the auto scaling settings (see the post) Fly will try to use a single request per machine as long as there are available machines to wake up.
One thing to take into consideration is that requests will timeout after 60 seconds if no bytes go through Fly routing layer. So either complete the job before 60s, or stream some bytes so that Fly doesn’t timeout, or just respond the request asap and then do the CPU job.
I don’t really track it, but if I needed to do it I could just do something like this:
select * from "AudioEncoding" where "state" = "ENCODING"
Like I said, all the encoding machines shut down eventually. The job worker who orchestrates all this is always awake.
Like I said, to solve this you could stream some bytes before ending the response. Streaming is a bit of a headache though because you can’t change the HTTP status and there are other considerations.
If you don’t have control over the app making the request, you could ask them to give you a webhook URL you could call when the long running task is complete.
I could have web and worker processes; web would receive webhooks and also check email for incoming work, and dispatch jobs to workers on an https url that blocks until the job is done. That’d make fly autoscale the worker clusters, so it’d work reasonably.
However, it means that I’d need to manually scale the web cluster if it ever got too busy - which is ok, because it wouldn’t be doing anything too expensive, except holding (potentially) lots of open connections to workers. I don’t know how to do zero downtime deploys with this setup either, but there’s probably a way.
I’d really prefer a single role of self-configuring server, though. It makes the architecture cleaner, and also makes it easier to work in the development environment. For that to work I’d need two things:
a) a way to set minimum size for a cluster (this seems like an easy thing that could be added in the processes
block); I cannot scale below 1, because as mentioned upthread I need to check email regularly.
b) a way to trigger scaling that’s not related to networking. I looked at the docs for the webhook host; they will close the connection after 10 seconds and then start incrementally backing off and retrying, so I can’t keep the connection open. They also don’t accept streaming bytes. I think the ability to set autoscaling via cpu utilization is a missing feature.
Your app will receive a signal to shutdown and you can do stuff before shutting down, like finishing a job, and not accepting any news jobs for that particular VM.
In Node this would be something like:
async function closeGracefully () {
// do stuff here to prepare for shutdown
await killDbConnection();
await app.close();
process.exit();
}
process.once('SIGTERM', closeGracefully);
By default, Fly will wait 5 seconds to kill a VM after sending the SIGTERM
signal but you can extend that with the kill_timeout
setting. Docs.
You could have an app or machine always up listening to changes on a db or a pubsub service and orchestrating via the machines API. This way you would have total control of the scaling and don’t need HTTP.
You could just move the CPU heavy process to different machines outside your main app like I did and trigger those with HTTP or the machines API. See my long comment above.
via the machines API.
Yep, was looking at that, though I haven’t played with it yet (Working with the Machines API · Fly Docs). If I had an inventory of machines that I’d created ahead of time, and just start/stop them on demand, that seems pretty straightforward. I’m not sure what would be needed to create machines on the fly - it appears that they need a public docker image(?) rather than being able to use an app that’s already set up.
If you run fly m list
on an existing app, you can use that image to create new machines. As an example:
fly m list -a silly-test-app
4d896d6a206428 rough-pond-3961 started ord billy-rust:deployment-01H03C3SMPR09MFBQ6S0DTC3WZ fdaa:1:3ad4:a7b:8c31:c327:d442:2 2023-05-10T17:44:33Z 2023-05-10T17:44:40Z v2 app shared-cpu-1x:256MB
You can specify registry.fly.io/billy-rust:deployment-01H03C3SMPR09MFBQ6S0DTC3WZ
as the image.
Do note though that the machine you’re getting the image from has to be in the same org, so you sadly can’t use billy-rust
itself.
The are pros and cons to both approaches.
But anyway, you can implement CPU autoscaling yourself with the machines API if you wish to do so.
If we don’t offer the feature set you’d like to see, it’s fair to observe that. But Pier’s just an engaged community member sharing solutions, so it’s definitely not his fault!
Me? Work for Fly?
I wish.
Yeah it would be great if Fly solved all of our problems. Two years ago I was also asking about CPU autoscaling but I think my current solution is even better as I have more control and more flexibility.
I’ve tried everything under the sun (Google Cloud Run etc) and Fly machines have been really the best solution I’ve found for my use case.
@akutruff I’m not fly.io but if your service is http then one option is having overloaded / maxed out servers reject requests and include the Fly-Reject
header in the response.
This will cause the fly proxy to try another server instead of returning the rejected response to the client.
It would require your app to be aware of its cpu utilization but depending on your use case it might be a suitable solution.
We actually shipped a preview of metrics based autoscaling to a handful of users. It wasn’t right for them. CPU autoscaling just isn’t all that important to us right now. That might change in the future.
And, this is nuanced, but we’re not using requests per second to scale. We’re using concurrent requests. And it’s (effectively) immediate. Our previous autoscaling was metrics based, and it didn’t react as quickly as people wanted.
You’ll actually get much quicker scaling behavior with Fly-Reject
and an app that’s CPU aware. Metrics based autoscaling is inherently laggy. Yes there’s code involved, but I don’t think it’s much more complicated than the code to configure AWS autoscaling with Terraform.
It does sound like you might be better off with AWS though. They already do what you’re looking for!
Hi @akutruff, it’s probably not what you’d like or ideally want but here’s an basic example of what you could do to scale based on CPU:
#!/bin/bash
if [ -z $FLY_API_TOKEN ]; then
echo "missing required FLY_API_TOKEN environment variable"
exit 1
fi
api_base=https://api.machines.dev
while getopts a:c: flag
do
case "${flag}" in
a) app_name=${OPTARG};;
c) cpu_threshold=${OPTARG};;
\?) # Invalid option
echo "Error: Invalid option"
echo "usage: ${0} -a app_name"
exit;;
esac
done
app_name="${APP_NAME:-$app_name}"
if [ -z $app_name ]; then
echo "missing required -a flag or APP_NAME environment variable"
exit 1
fi
cpu_threshold="${CPU_THRESHOLD:-$cpu_threshold}"
if [ -z $cpu_threshold ]; then
echo "missing required -c flag or CPU_THRESHOLD environment variable"
exit 1
fi
echo "autoscaling ${app_name} based on CPU threshold: ${cpu_threshold}"
while :
do
sleep_seconds=60
echo "getting CPU utilization"
cpu_util=$(curl -sS -H "Authorization: Bearer ${FLY_API_TOKEN}" \
https://api.fly.io/prometheus/fly/api/v1/query \
--data-urlencode "query=sum(increase(fly_instance_cpu{app=\"${app_name}\", mode!=\"idle\"}[60s]))/60 / sum(count(fly_instance_cpu{app=\"${app_name}\", mode=\"idle\"})without(cpu))" \
| jq -r '(.data.result[0].value[1] | tonumber)*100 | floor')
echo "current CPU utilization: ${cpu_util}"
if [[ "$cpu_util" -gt "$cpu_threshold" ]]; then
echo "need moar instances!"
stopped_machines=()
for machine in $(curl -sS -XGET -H "Authorization: Bearer ${FLY_API_TOKEN}" -H "Content-Type: application/json" \
"${api_base}/v1/apps/${app_name}/machines" \
| jq -jr '.[] | select( .state == "stopped" ) | .id, " "'); do
stopped_machines+=($machine)
done
size=${#stopped_machines[@]}
if [[ "$size" -eq 0 ]]; then
echo "no more machines to start! considering adding more"
else
echo "number of machines eligible to start: ${size}"
index=$(($RANDOM % $size))
random_machine=${stopped_machines[$index]}
echo "starting ${random_machine}"
curl -sS -XPOST -H "Authorization: Bearer ${FLY_API_TOKEN}" -H "Content-Type: application/json" \
"${api_base}/v1/apps/${app_name}/machines/${random_machine}/start"
echo "machine successfully started, waiting 5 minutes before checking CPU utilization again"
sleep_seconds=300
fi
fi
sleep $sleep_seconds
done
With the above, it does a prometheus query for CPU utilization for the app and will start any stopped machines up until there are no more machines left to start. You could use this in conjunction with our auto_stop proxy feature so your app doesn’t have to worry about when to scale down to zero.
You can deploy the above into a separate App that’s always running and because we give customers direct access to their metrics, it can be tailored to suite your needs. If your app is able to quantify the requests enough to expose as its own custom metrics, our system can scrape those and they could then be used to make your autoscaling decisions.
Hope this helps!
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.