Automatically starting/stopping Apps v2 instances

We’ve introduced a new feature to automatically start/stop instances. When enabled, the proxy will scale instances of your app up/down as demand changes.

This feature can be enabled in the services section of the fly.toml

[[services]]
  # automatically start machines
  auto_start_machines = true
  # automatically stop machines
  auto_stop_machines = true
  ...
  ...
  internal_port = 8080
  protocol = "tcp

Similarly if you’re using http_service

[http_service]
  # automatically start machines
  auto_start_machines = true
  # automatically stop machines
  auto_stop_machines = true
  ...
  ...
  internal_port = 8080
  protocol = "tcp

Default settings

New apps have both automatic starting and automatic stopping enabled by default

auto_start_machines = true
auto_stop_machines = true

Existing applications are automatically started but not automatically stopped.

auto_start_machines = true
auto_stop_machines = false

When should I use it?

This feature is useful if you have highly variable workloads. Your instances will be able to start/stop automatically as demand increases/decreases. The central benefit to doing this is cost reduction. Instead of having to run excess instances to handle peak load, only what is necessary is running at any given point in time, saving you your hard-earned :money_with_wings:.

This feature is slightly different from what is typical in autoscaling in that we don’t create instances for you up to a specified maximum. It will automatically start existing instances. If you want to have 10 instances available to start to service requests, you need to create 10 instances of your app.

Recommended use

It is recommended to set both settings to the same value. If auto_start_machines is enabled but auto_stop_machines is disabled, the proxy will start your instances but they will never be stopped. This is fine for cases where you want to manually stop instances but if not, your instances will be left running indefinitely (and cost you your hard-earned money!).

If auto_start_machines is disabled but auto_stop_machines is enabled, the proxy will scale your instances down but will not be able to start them again. If all of your instances are scaled down, requests will start failing.

When not to use it

At the moment, we don’t support specifying a minimum number of running machines. Apps will scale down to zero if auto_stop_machines is enabled and there’s no traffic. If you need your application to be “always on”, disable this setting.

How does it work?

The settings auto_start_machines and auto_stop_machines instruct our internal Fly Proxy to automatically start/stop instances of your app (which are Fly Machines).

Autostart

If auto_start_machine is enabled, it will automatically start instances as follows:

  • A new request is made to your application
  • All the running instances are above their soft limit
  • If there are stopped instances, the proxy will pick one from the nearest region and start it
  • The request will then be sent to the started instance

Autostop

If auto_stop_machines is enabled, it will automatically stop instances as follows:

  • The proxy looks at all instances of your app in a given region e.g fra
  • It finds out how many of these instances are above and below their soft limit
    • If there is more than one instance in the region, it calculates whether there is excess capacity in the region using the formula: excess instances = num of instances - (num of instances over soft limit + 1). For example, if we had 9 instances and 4 of them were over their soft limit, the excess capacity is: excess = 9 - (4 + 1) = 4, meaning we have 4 additional instances than we need to service the current traffic we have.
    • This algorithm is based on the assumption that you’d need one more instance than the number of instances over their soft limit i.e we need 5 instances running if 4 are over their soft limit. We’ll be monitoring how this plays out in production settings and adjust it if necessary.
  • If there is excess capacity, one instance is stopped.
  • If there is only one instance in the region, the proxy checks if it has any load. If its load is 0 (i.e there is no traffic to your instance), then it is stopped.
  • This process runs every few minutes. If there are a number of instances that can be stopped, it’ll occur over a period of time as only one instance is stopped per iteration of this process.

Again, I’d like to point out that this downscaling process happens on a regional basis. If you have traffic in ams but not in fra, your fra instances will be stopped but your ams instances will remain running (subject to the formula).

Feedback

If you have feedback, comments and questions, please share!

27 Likes

This seems like a nice feature, and although auto_stop_machines seems to work nicely, auto_start_machines doesn’t seem to work for me. I set it to false, so I can temporarily pause my machines by doing fly machine stop [machine id]. However, it seems the machine spins back up again as soon as a request comes in, even when auto_start_machines is set to false. I want the machine to remain suspended until I manually start it again. Am I doing something wrong?

1 Like

Can we set a minimun time for each machine to stay alive, and then auto stop ?

3 Likes

Thanks for this! We just identified the cause of this issue. I’ll let you know when we’ve deployed the fix.

2 Likes

At this point in time, unfortunately not. However, interested to hear what use cases you would need this for.

One way to go about this is to handle stopping the machine yourself. If you need your machine to stay alive for a period of time each time it’s started, it could just exit after that duration.

2 Likes

That’s great! You guys are doing an amazing job.

4 Likes

What’s the minimum supported duration? Kurt once mentioned that duration below 120s is probably low: Machine did not have a restart policy, defaulting to restart - #12 by kurt

Also, when exiting the machine, does the Fly Proxy send the kill_signal (as defined in fly.toml) and wait for kill_timeout secs?

1 Like

Yesterday it looked like it was 30 seconds. As you know, it is probably all is subject to change. :slight_smile:

1 Like

Not yet. We’re currently working to expose this for Apps v2.

2 Likes

I already wrote a script doing that, it works great so far.
it was just to know if I could get rid of my script.

My users come in grapes (student at school), with few minutes between each “batch”, and I scale-to-zero an app when their is no users planned for a long period of time (at night in my region or sometimes in the day).
I don’t want to stop the machine between those grapes during the peek of activity, because the machine would start and stop every 5 minutes and clean all the local cache.

1 Like

Doesn’t seem to work for me.

My flyctl is up to date. I’ve fly launch a django app and

 [http_service]
     internal_port = 8000
     force_https = true
     auto_stop_machines = true
     auto_start_machines = true

has been automatically added to fly.toml.

Whith a single machine it was indeed working and stopping it when no request were coming, although I guess it was way too fast, few seconds without a request and the machine was stopped.

But when cloning it (in the same region) it didn’t scale to 1 nor to 0 and both machines were still on a started satate after 10+ minutes although no requests were made.

Edit : I posted this message this around 10AM, it’s now 2AM, the 2 instances stayed “started” all day and it didn’t scale down at all, I’ve destroyed one machine and the one left just stopped a little bit after, so it confirmed what I described earlier.

Edit2 : Just cloned 2 times thus 3 machines and two of them stopped, so either its a bug when you have 2 machines or I don’t get the logic.
For 3 instances and 0 under under soft limit : excess = 3 - (0+1) = 2, so you kill 2, 2 are stopped 1 running, it’s as expected. But for 2 with the same logic 1 should be running and 1 stopped, here they both stay in a started state.

On a side not shouldn’t it be an opt in option to scale down to zero and suffer from cold starts ?

3 Likes

Hi, do service http health-checks wake up machines?

We shipped a fix for this last week. Let me know if it is working for you

1 Like

Health checks do not wake up machines.

3 Likes

I just tried to reproduce this issue myself and didn’t come across it. It’s likely that the updated proxy wasn’t deployed in the region your application was running on at the time of the post. Is this still a problem for you?

It’s a good point. We’re still thinking about this internally and if and how we would support it. That may look like another configuration option or it could be part of a more advanced autoscaling implementation.

I mean why would have it worked for 1 and 3 machines but not 2 if it was the case ?

I tried to reproduce and :

With 1 instance it get stopped almost immediately, with 2 one stay started and the other one stopped, sending a bunch of requests make the second one starts almost immediately too.

But with 3 instances without sending any request 2 stay started and only one get stopped. (edit : I’ve sent a bunch of requests and now 3 out of 3 started but they’re now staying in a started state and it doesn’t seem they’ll stop )

Do you mind sharing your application name? It’ll be easier for me to debug if I can view logs/see if things are working on the hosts your application is on?

Pm me if it’s possible here, it’s in cdg btw, but yeah things aren’t consistent, the three instances are still in a “started” state without any requests sent to them.

1 Like

We just shipped a fix that should solve this. Let me know if its working for you

They’re still all in a started state. Do I need to update to 0.0.541 + fly launch again ?

Edit : Updated flyctl and nuked everything, the results :

First I noticed that this time fly launch spawned 2 machines and not 1 (I don’t have different processes and am using overmind).

I then cloned to have 3 instances, the 3 instances then stopped.

I sent a bunch of requests, the 3 started.

Then the 3 went in a “stopped” state.

I then destroyed a machine and reproduced for 2, the 2 started when I sent a bunch of requests and they stopped 1 by 1 after that.

So it seems to work now.

1 Like