Autoscaling on CPU utilization?

JP_Phillips · May 11, 2023, 7:34pm

Hi @akutruff, it’s probably not what you’d like or ideally want but here’s an basic example of what you could do to scale based on CPU:

#!/bin/bash

if [ -z $FLY_API_TOKEN ]; then
    echo "missing required FLY_API_TOKEN environment variable"
    exit 1
fi

api_base=https://api.machines.dev

while getopts a:c: flag
do
    case "${flag}" in
        a) app_name=${OPTARG};;
        c) cpu_threshold=${OPTARG};;
        \?) # Invalid option
         echo "Error: Invalid option"
         echo "usage: ${0} -a app_name"
         exit;;
    esac
done

app_name="${APP_NAME:-$app_name}"
if [ -z $app_name ]; then
    echo "missing required -a flag or APP_NAME environment variable"
    exit 1
fi

cpu_threshold="${CPU_THRESHOLD:-$cpu_threshold}"
if [ -z $cpu_threshold ]; then
    echo "missing required -c flag or CPU_THRESHOLD environment variable"
    exit 1
fi

echo "autoscaling ${app_name} based on CPU threshold: ${cpu_threshold}"

while :
do
  sleep_seconds=60

  echo "getting CPU utilization"
  cpu_util=$(curl -sS -H "Authorization: Bearer ${FLY_API_TOKEN}" \
    https://api.fly.io/prometheus/fly/api/v1/query \
    --data-urlencode "query=sum(increase(fly_instance_cpu{app=\"${app_name}\", mode!=\"idle\"}[60s]))/60 / sum(count(fly_instance_cpu{app=\"${app_name}\", mode=\"idle\"})without(cpu))" \
    | jq -r '(.data.result[0].value[1] | tonumber)*100 | floor')

  echo "current CPU utilization: ${cpu_util}"

  if [[ "$cpu_util" -gt "$cpu_threshold" ]]; then
      echo "need moar instances!"
      stopped_machines=()
      for machine in $(curl -sS -XGET -H "Authorization: Bearer ${FLY_API_TOKEN}" -H "Content-Type: application/json" \
        "${api_base}/v1/apps/${app_name}/machines" \
        | jq -jr '.[] | select( .state == "stopped" ) | .id, " "'); do

        stopped_machines+=($machine)
      done
      size=${#stopped_machines[@]}

      if [[ "$size" -eq 0 ]]; then
        echo "no more machines to start! considering adding more"
      else
        echo "number of machines eligible to start: ${size}"
        index=$(($RANDOM % $size))
        random_machine=${stopped_machines[$index]}
        echo "starting ${random_machine}"
        curl -sS -XPOST -H "Authorization: Bearer ${FLY_API_TOKEN}" -H "Content-Type: application/json" \
          "${api_base}/v1/apps/${app_name}/machines/${random_machine}/start"

        echo "machine successfully started, waiting 5 minutes before checking CPU utilization again"
        sleep_seconds=300
      fi
  fi

  sleep $sleep_seconds
done

With the above, it does a prometheus query for CPU utilization for the app and will start any stopped machines up until there are no more machines left to start. You could use this in conjunction with our auto_stop proxy feature so your app doesn’t have to worry about when to scale down to zero.

You can deploy the above into a separate App that’s always running and because we give customers direct access to their metrics, it can be tailored to suite your needs. If your app is able to quantify the requests enough to expose as its own custom metrics, our system can scrape those and they could then be used to make your autoscaling decisions.

Hope this helps!

Topic		Replies	Views
CPU autoscaling	4	425	June 3, 2021
Rolling my own autoscaling for Fly Machines	11	2306	August 24, 2023
Is there a way to replicate this setup in Fly?	2	586	February 27, 2022
Issue with Autoscaling Based on Request Count in Fly.io autoscaling , proxy	5	62	October 27, 2024
Auto Scaling - The threshold of when to scale up. Questions / Help docs	7	1028	August 18, 2022

Autoscaling on CPU utilization?

Related topics