Setting a minimum number of instances to keep running when using auto start/stop

We now support setting a minimum number of machines to keep running when using the automatic start/stop feature for Apps v2. This will prevent the specified number of machines from being stopped. Update your flyctl to the latest version and then in your fly.toml

[[services]]
  auto_start_machines = true
  auto_stop_machines = true
  min_machines_running = 1
  ...
  ...

Similarly if you’re using http_service

[http_service]
  auto_start_machines = true
  auto_stop_machines = true
  min_machines_running = 1
  ...
  ...

When should you use this?*

If instances of your application take a while to start and that is unacceptable for your use case, you will benefit having at least 1 instance always running (min_machines_running = 1). When a new request comes in, instead of having to wait for the app to start up in the case it was scaled down completely (i.e the cold start problem), it is able to respond immediately.

What you need to know

The most important is that we only keep instances running in the primary region of your app. All other regions will still get scaled down to 0. As an example, if min_machines_running = 3, then you’ll need 3+ instances in your primary region.

Some other things to know:

  • The max number of machines we can scale up to is implicitly defined by the number of machines your app has. We will scale your app all the way up if the demand requires it and scale back down to the minimum specified
  • The default minimum is set to 0

* This does not solve the cold start problem entirely. When a request comes in and the proxy decides to start a new instance, that request waits for the new instance to start. We don’t start a new instance while servicing the current request with an already running instance. So while you may not run into a cold start for your first instance, if we start a second one, that request will run into it. We’re giving some thought to how to solve this and as always, will post on here once we’ve got a solution for you :smile:

17 Likes

Awesome @senyo ! Thanks so much for tackling this feature so fast. :rocket:


I just tried it out, but I have the feeling, that it doesn’t work correctly.

My fly.toml looks like this:

app = "peter-kuhmann-website"
primary_region = "ams"

[http_service]
  internal_port = 3000
  force_https = true
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 1

I have two machines (I cloned one of them just recently).

According to the monitoring and the logs, both machines got scaled down:

2023-05-12T19:05:56.552 proxy [6e82d956a79408] ams [info] Downscaling app peter-kuhmann-website in region ams. Automatically stopping machine 6e82d956a79408. 2 instances are running, 0 are at soft limit, we only need 1 running

2023-05-12T19:05:56.558 app[6e82d956a79408] ams [info] Sending signal SIGINT to main child process w/ PID 513

2023-05-12T19:05:56.746 app[6e82d956a79408] ams [info] Starting clean up.

2023-05-12T19:05:57.746 app[6e82d956a79408] ams [info] [ 405.553727] reboot: Restarting system

2023-05-12T19:07:18.119 proxy [5683d920b1618e] ams [info] Downscaling app peter-kuhmann-website in region ams. Automatically stopping machine 5683d920b1618e. 1 instance is running but has no load

2023-05-12T19:07:18.122 app[5683d920b1618e] ams [info] Sending signal SIGINT to main child process w/ PID 513

2023-05-12T19:07:18.628 app[5683d920b1618e] ams [info] Starting clean up.

2023-05-12T19:07:19.630 app[5683d920b1618e] ams [info] [ 503.747677] reboot: Restarting system

Interesting: 2 instances are running, 0 are at soft limit, we only need 1 running on first downscale it seems to “know” the min setting.

But the second check doesn’t seem to take it into account: 1 instance is running but has no load.

Did I miss a specific configuration or precondition?

Best
Peter :slight_smile:

1 Like

Appreciate the kind words!

Ah, it’s possibly due to an outdated flyctl version, what version are you using? I forgot to mention to upgrade it in the post, I’ve added it in now.

1 Like

That was the solution. Now it works as expected. Brilliant!

Learned something as a Fly-newbie: New feature, check for flyctl updates! :smiling_face:

5 Likes

Awesome, glad it’s working!

Superb! A lot of great work by a really great team!

1 Like

This is awesome! I literally made a post about this pain point a few days ago and came to the forums for another reason to see that its implemented as a feature!

Thanks for all the hard work fly team :slight_smile:

3 Likes

Can you change min_machines_running using the cli? So we can have scripts lower them at night?

We currently don’t support changing that setting via the CLI

Hi,

I’m trying to set up min_machines_running = 1 to reduce cold starts, although it seems like my deployment still scales down to 0.

It should be using the latest version of the CLI, as it gets fetched by a CI job that installs it like this

          curl -L https://fly.io/install.sh | sh
          export FLYCTL_INSTALL="/home/runner/.fly"
          export PATH="$FLYCTL_INSTALL/bin:$PATH"

My fly.toml configuration looks like this:

                app = frictionless-chat-production
                primary_region = ewr
                [http_service]
                  internal_port = 8080
                  force_https = true
                  auto_stop_machines = true
                  auto_start_machines = true
                  min_machines_running = 1

Is there something I’m missing to set the min instances?

Thanks,
Carter

I looked at your app and both the instances of your machine are running in the region iad. However, your primary_region is set to ewr. Autostop only keeps machines running in the primary region of your application. Did you do anything that caused your machines to deploy in iad? If not, then its an issue on our side.

Hi Senyo, thanks for helping look into this. I destroyed the iad machines and deployed from my CI build automation, and scaling appears to be working great now. My CI deployment scripts are all configured to deploy to ewr.

What I think happened is when I was setting things up originally and running commands manually on my desktop a several weeks ago to debug various issues, I must have made a typo once and deployed a machine to iad by accident. After that the machine stuck around, I didn’t notice the region was wrong since my deployments were being done with the CI automation which didn’t blow away the extra iad machines.

So thank you for helping identify the issue and pointing it out.

To reduce this type of error in the future, I wonder if there is a way to have the toml file be more explicit about the final state of the deployment so that it could be a single source of truth?

I have two process groups; web and worker.

I have web in a [http_service] and have set

auto_stop_machines = true
auto_start_machines = true
min_machines_running = 0

It works well.

However, I need worker to scale to 0 as well when not in use. So I created a [[services]] section and added:

processes = ["worker"]
auto_stop_machines = true
auto_start_machines = true
min_machines_running = 0

It doesn’t work. The worker process stays alive throughout.

How can I make the worker process scale to 0 as well?

Could you post your entire config? You can remove the app name if need be

Hello, here it is:

app = "xxxxxxx"
primary_region = "xxx"

[processes]
web = "xxxxxx xxxxx.xxxx xx"
worker = "xxxxxxxxx xxxxxx xx"

[http_service]
processes = ["web"]
internal_port = 8008
force_https = true
auto_stop_machines = true
auto_start_machines = true
min_machines_running = 0


[[services]]
processes = ["worker"]
auto_stop_machines = true
auto_start_machines = true
min_machines_running = 0

Your service has no publicly exposed ports. The autostart/autostop function is driven by our internal proxy which only knows of your application if you have a service with exposed ports.

1 Like

@senyo this has been working pretty well. But i noticed that it does not respect process groups.

app = "brick-drop-co"
primary_region = "ord"

[build]
  strategy = "canary"

[processes]
  web = "litefs mount -config /etc/litefs.web.yml"
  dir = "litefs mount -config /etc/litefs.directus.yml"

#[http_service]
#  internal_port = 8080
#  force_https = true
#  auto_stop_machines = true
#  auto_start_machines = true
#  min_machines_running = 1
#  processes = ["web"]
#
#  [http_service.concurrency]
#    type = "connections"
#    hard_limit = 50
#    soft_limit = 25
#
#  [http_service.http_options.response.headers]
#    X-Process-Group = "web"
#    X-Frame-Options = "SAMEORIGIN"
#    X-XSS-Protection = "1; mode=block"
#    X-Content-Type-Options = "nosniff"
#    Referrer-Policy = "strict-origin-when-cross-origin"
#    Content-Security-Policy = "default-src 'self' 'unsafe-inline' 'unsafe-eval' data:; img-src * data:; font-src * data:; style-src * 'unsafe-inline'; script-src * 'unsafe-inline' 'unsafe-eval'; connect-src *; frame-src *; object-src *; media-src *; child-src *; form-action *; frame-ancestors *; block-all-mixed-content; upgrade-insecure-requests; manifest-src *; worker-src *; prefetch-src *;"
#
#  [[http_service.checks]]
#    grace_period = "240s"
#    interval = "120s"
#    method = "GET"
#    timeout = "10s"
#    path = "/"

[[services]]
  internal_port = 8080
  protocol = "tcp"
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 1
  processes = ["web"]

  [[services.http_checks]]
    interval = "15s"
    grace_period = "5s"
    method = "get"
    path = "/"
    protocol = "http"
    timeout = "5s"
    tls_skip_verify = true

  [[services.ports]]
    handlers = ["http"]
    port = 80
    force_https = true
    [services.ports.http_options.response.headers]
      X-Process-Group = "web"

  [[services.ports]]
    handlers = ["tls", "http"]
    port = 443
    [services.ports.http_options.response.headers]
      X-Process-Group = "web"

[[services]]
  internal_port = 8054
  protocol = "tcp"
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 1
  processes = ["dir"]

  # TODO: remove this once we have a way to set up a TCP service
  [[services.ports]]
    handlers = ["http"]
    port = 3000
    force_https = false
    [services.ports.http_options.response.headers]
      X-Process-Group = "dir"

  [services.concurrency]
    type = "connections"
    hard_limit = 25
    soft_limit = 20

  [[services.http_checks]]
    interval = "15s"
    grace_period = "5s"
    method = "get"
    path = "/admin/login"
    protocol = "http"
    timeout = "5s"
    tls_skip_verify = true

[checks]
  [checks.dir]
    grace_period = "5s"
    interval = "15s"
    method = "get"
    path = "/admin/login"
    port = 8054
    timeout = "5s"
    type = "http"
    processes = ["dir"]

[mounts]
  source = "litefs"
  destination = "/var/lib/litefs"
  processes= ["web", "dir"]

[metrics]
  port = 9091       # default for most prometheus clients
  path = "/metrics"

I can’t use the http_service because of another issue (Hanging on 'Configuring firecracker" in ORD - #5 by Zane_Milakovic), so ignore that.

But as you can see each service has min_machines_running = 1.

When I have only 2 VMs in the app, this shows -

ord [info] Downscaling app brick-drop-co in region ord from 2 machines to 1 machines. Automatically stopping machine 080e442c5405d8

As you can see, it’s my primary region, and it has ports. I know I can connect too it externally.

Luckily the one it shuts down, is the one that starts up fast. Not sure why it picks web to be the process group that shuts down. Maybe its deploy order?

But it’s not respected, and thinking about setting minimum to 2.

About to clone and setup HA, and add another region, so have not tried this yet, but wanted to report the bug for you all.

1 Like

Thank you for this, it revealed a bug on our side. We just shipped a fix for this that should be rolled out. Let me know if you’re still having issues.

1 Like

Of course. I ended up switching to two apps for various reasons. But I am happy to see this is fixed if I ever use process groups again.

1 Like

I have a different issue that I’m trying to fix.
I’ve googled, read the docs and the threads here, but without success.
I have a staging environment to test my application, where I want to stop all machines when there is no load (for cost reduction).

I have the .toml file below. It sets the min_machines_running = 0 and destroys idle machines. However, since this staging environment is used by only 1 or 2 devs (usually not even simultaneously), machines seem to be destroyed even before I can send a second request to my http service. It seems there is a too short idle delay before shutting down the machines. For apps in a production environment with actually concurrent users, that may work fine: when there is no requests, machines can be immediately destroyed. But for staging environments, a minimum idle timeout should happen before destroying machines.

Sometimes I log in and, when I try to click some link in the returned web page, I get disconnected afterwards. My app uses session authentication, so it seems the VMs are destroyed and the session is cleared.

I’ve tried to change type from “requests” to “connections” but without success.

app = "competeaqui-staging"
primary_region = "gru"

[http_service]
  force_https = true
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 0
  [concurrency]
    type = "requests"

The app works well when setting the min_machines_running to 1.
Logs shows no errors:

2023-12-11 08:17:38.800 [competeaqui_fly_io_logs] [INFO] gru eb5d 5683003a66e08e competeaqui-staging Downscaling app competeaqui-staging from 2 machines to 1 machines, stopping machine 5683003a66e08e (region=gru, process group=app)
2023-12-11 08:17:38.800 [competeaqui_fly_io_logs] [INFO] gru eb5d 5683003a66e08e competeaqui-staging Downscaling app competeaqui-staging from 2 machines to 1 machines, stopping machine 5683003a66e08e (region=gru, process group=app)
2023-12-11 08:17:38.803 [competeaqui_fly_io_logs] [INFO] gru eb5d 5683003a66e08e competeaqui-staging  INFO Sending signal SIGINT to main child process w/ PID 314
2023-12-11 08:17:38.803 [competeaqui_fly_io_logs] [INFO] gru eb5d 5683003a66e08e competeaqui-staging  INFO Sending signal SIGINT to main child process w/ PID 314
2023-12-11 08:17:38.847 [competeaqui_fly_io_logs] [INFO] gru eb5d 5683003a66e08e competeaqui-staging 2023-12-11T11:17:38.846Z  INFO 314 --- [ionShutdownHook] j.LocalContainerEntityManagerFactoryBean : Closing JPA EntityManagerFactory for persistence unit 'default'
2023-12-11 08:17:38.847 [competeaqui_fly_io_logs] [INFO] gru eb5d 5683003a66e08e competeaqui-staging 2023-12-11T11:17:38.846Z  INFO 314 --- [ionShutdownHook] j.LocalContainerEntityManagerFactoryBean : Closing JPA EntityManagerFactory for persistence unit 'default'
2023-12-11 08:17:38.854 [competeaqui_fly_io_logs] [INFO] gru eb5d 5683003a66e08e competeaqui-staging 2023-12-11T11:17:38.849Z  INFO 314 --- [ionShutdownHook] com.zaxxer.hikari.HikariDataSource       : HikariPool-1 - Shutdown initiated...
2023-12-11 08:17:38.854 [competeaqui_fly_io_logs] [INFO] gru eb5d 5683003a66e08e competeaqui-staging 2023-12-11T11:17:38.849Z  INFO 314 --- [ionShutdownHook] com.zaxxer.hikari.HikariDataSource       : HikariPool-1 - Shutdown initiated...
2023-12-11 08:17:38.857 [competeaqui_fly_io_logs] [INFO] gru eb5d 5683003a66e08e competeaqui-staging 2023-12-11T11:17:38.856Z  INFO 314 --- [ionShutdownHook] com.zaxxer.hikari.HikariDataSource       : HikariPool-1 - Shutdown completed.
2023-12-11 08:17:38.857 [competeaqui_fly_io_logs] [INFO] gru eb5d 5683003a66e08e competeaqui-staging 2023-12-11T11:17:38.856Z  INFO 314 --- [ionShutdownHook] com.zaxxer.hikari.HikariDataSource       : HikariPool-1 - Shutdown completed.
2023-12-11 08:17:39.839 [competeaqui_fly_io_logs] [INFO] gru eb5d 5683003a66e08e competeaqui-staging  INFO Main child exited normally with code: 130
2023-12-11 08:17:39.839 [competeaqui_fly_io_logs] [INFO] gru eb5d 5683003a66e08e competeaqui-staging  INFO Main child exited normally with code: 130
2023-12-11 08:17:39.839 [competeaqui_fly_io_logs] [INFO] gru eb5d 5683003a66e08e competeaqui-staging  INFO Starting clean up.
2023-12-11 08:17:39.839 [competeaqui_fly_io_logs] [INFO] gru eb5d 5683003a66e08e competeaqui-staging  INFO Starting clean up.
2023-12-11 08:17:39.840 [competeaqui_fly_io_logs] [INFO] gru eb5d 5683003a66e08e competeaqui-staging  WARN hallpass exited, pid: 315, status: signal: 15 (SIGTERM)
2023-12-11 08:17:39.840 [competeaqui_fly_io_logs] [INFO] gru eb5d 5683003a66e08e competeaqui-staging  WARN hallpass exited, pid: 315, status: signal: 15 (SIGTERM)
2023-12-11 08:17:39.843 [competeaqui_fly_io_logs] [INFO] gru eb5d 5683003a66e08e competeaqui-staging 2023/12/11 11:17:39 listening on [fdaa:2:29ff:a7b:1f61:f32c:974f:2]:22 (DNS: [fdaa::3]:53)
2023-12-11 08:17:39.843 [competeaqui_fly_io_logs] [INFO] gru eb5d 5683003a66e08e competeaqui-staging 2023/12/11 11:17:39 listening on [fdaa:2:29ff:a7b:1f61:f32c:974f:2]:22 (DNS: [fdaa::3]:53)


2023-12-11 08:17:40.840 [competeaqui_fly_io_logs] [INFO] gru eb5d 5683003a66e08e competeaqui-staging [  323.720887] reboot: Restarting system
2023-12-11 08:17:40.840 [competeaqui_fly_io_logs] [INFO] gru eb5d 5683003a66e08e competeaqui-staging [  323.720887] reboot: Restarting system