min_machines_running = 1 not working correctly

MattClegg · June 19, 2023, 6:16am

I’ve set min_machines_running = 1 in my .toml file to ensure there’s always one inctance of my app running, but it doesn’t seem to be working.

I’m trying to run a Directus instance but the startup times are too slow to have it dynamically scale down to 0. It runs fine for about 6 minutes then it tries to downscale, even though I’ve set min_machines_running to 1. I’ve taken inspiration from this example repo: https://github.com/freekrai/directus-fly/blob/main/fly.toml.

My .toml file looks like this:

# fly.toml app configuration file generated for app-name-goes-here on 2023-06-18T07:20:56+01:00
#
# See https://fly.io/docs/reference/configuration/ for information about how to use this file.
#

app = "app-name-goes-here"
kill_signal = "SIGINT"
kill_timeout = 15
primary_region = "lhr"

[env]
  DB_CLIENT="sqlite3"
  DB_FILENAME="/data/database/data.db"
  STORAGE_LOCATIONS="local"
  STORAGE_LOCAL_DRIVER="local"
  STORAGE_LOCAL_ROOT="/data/uploads"
  PUBLIC_URL="https://url-goes-here.fly.dev"
  PORT=8080

[experimental]
  allowed_public_ports = []
  auto_rollback = true
  cmd = "start.sh"
  entrypoint = "sh"

[build]
  dockerfile = ".\\Dockerfile"

[mounts]
  source="directus_data"
  destination="/data"

[[services]]
  internal_port = 8080
  processes = ["app"]
  auto_stop_machines = true
  auto_start_machines = true
  protocol = "tcp"
  min_machines_running = 1
  script_checks = []
  [services.concurrency]
    hard_limit = 25
    soft_limit = 20
    type = "connections"

  [[services.http_checks]]
    grace_period = "30s"
    interval = "15s"
    method = "get"
    path = "/server/health"
    protocol = "http"
    timeout = 2000
    tls_skip_verify = false
    [services.http_checks.headers]

  [[services.ports]]
    force_https = true
    handlers = ["http"]
    port = 80

  [[services.ports]]
    handlers = ["tls", "http"]
    port = 443

  [[services.tcp_checks]]
    grace_period = "30s"
    interval = "15s"
    restart_limit = 0
    timeout = "2s"

The Dockerfile and start.sh file are very minimal:

FROM directus/directus:10.3

USER node
WORKDIR /directus

COPY . .

CMD ["bash", "start.sh"]

# This file is how Fly starts the server (configured in fly.toml). Before starting
# the server though, we need to run any migrations that haven't yet been
# run, which is why this file exists in the first place.
# Learn more: https://community.fly.io/t/sqlite-not-getting-setup-properly/4386

#!/bin/sh

set -ex
mkdir -p /data/database
mkdir -p /data/uploads
chmod -Rf 777 /data/database
chmod -Rf 777 /data/uploads


npx directus bootstrap
npx directus start

These are the logs when the machine starts to downscale:

2023-06-19T06:03:29.054 app[3d8d9349b77438] lhr [info] [06:03:29] GET /server/health 200 8ms
2023-06-19T06:03:44.120 app[3d8d9349b77438] lhr [info] [06:03:44] GET /server/health 200 8ms
2023-06-19T06:03:53.515 proxy [3d8d9349b77438] lhr [info] Downscaling app app-name-goes-here in region lhr. Automatically stopping machine 3d8d9349b77438. 2 instances are running, 0 are at soft limit, we only need 1 running
2023-06-19T06:03:53.521 app[3d8d9349b77438] lhr [info] INFO Sending signal SIGINT to main child process w/ PID 521
2023-06-19T06:03:58.677 app[3d8d9349b77438] lhr [info] INFO Sending signal SIGTERM to main child process w/ PID 521
2023-06-19T06:03:58.973 app[3d8d9349b77438] lhr [info] INFO Main child exited with signal (with signal 'SIGTERM', core dumped? false)
2023-06-19T06:03:58.974 app[3d8d9349b77438] lhr [info] INFO Starting clean up.
2023-06-19T06:03:58.974 app[3d8d9349b77438] lhr [info] INFO Umounting /dev/vdb from /data
2023-06-19T06:03:58.975 app[3d8d9349b77438] lhr [info] ERROR error umounting /data: EBUSY: Device or resource busy, retrying in a bit
2023-06-19T06:03:59.185 app[3d8d9349b77438] lhr [info] [06:03:59] GET /server/health 200 8ms
2023-06-19T06:03:59.728 app[3d8d9349b77438] lhr [info] ERROR error umounting /data: EBUSY: Device or resource busy, retrying in a bit
2023-06-19T06:04:00.480 app[3d8d9349b77438] lhr [info] ERROR error umounting /data: EBUSY: Device or resource busy, retrying in a bit
2023-06-19T06:04:01.232 app[3d8d9349b77438] lhr [info] ERROR error umounting /data: EBUSY: Device or resource busy, retrying in a bit
2023-06-19T06:04:01.987 app[3d8d9349b77438] lhr [info] WARN hallpass exited, pid: 522, status: signal: 15 (SIGTERM)
2023-06-19T06:04:02.000 app[3d8d9349b77438] lhr [info] 2023/06/19 06:04:01 listening on [fdaa:2:5f23:a7b:13e:8e80:bb03:2]:22 (DNS: [fdaa::3]:53)
2023-06-19T06:04:02.985 app[3d8d9349b77438] lhr [info] [ 487.722219] reboot: Restarting system
2023-06-19T06:04:15.341 health[3d8d9349b77438] lhr [error] Health check on port 8080 has failed. Your app is not responding properly. Services exposed on ports [80, 443] will have intermittent failures until the health check passes.
2023-06-19T06:04:15.341 health[3d8d9349b77438] lhr [error] Health check on port 8080 has failed. Your app is not responding properly. Services exposed on ports [80, 443] will have intermittent failures until the health check passes.

Have I missed something in the configuration to keep one instance always alive? I’m not sure why it says there are 2 instances running when I’ve set the scaling count to 1.

andie · June 19, 2023, 11:00pm

Hi @MattClegg

When you say it’s not working correctly, do you mean your app is scaling down to zero Machines (instances)?

If you run fly status is there only one Machine?

I’m not sure about the unmounting error, but it looks like you have 2 Machines and the Fly Proxy is trying to downscale to 1 per the configuration (auto_stop_machines and min_machines_running).

fly launch creates 2 machines by default for redundancy. More info here:

MattClegg · June 20, 2023, 6:06am

Hi @andie

Yeah it scales down to zero after several minutes.

When I run fly status I see one machine, I was expecting that one machine to stay active if I set min_machines_running = 1, unless I’ve missundersood how that works.

I ran fly scale count 1 after I created the app.

andie · June 20, 2023, 12:21pm

Does it always have the same log message?

2 instances are running, 0 are at soft limit, we only need 1 running

even though you only have 1 Machine? Or is there a different message?

And has your app restarted successfully or are the health checks failing also a separate issue?

MattClegg · June 20, 2023, 12:54pm

Yeah, I’ve only ever seen 2 instances are running logged.

It doesn’t look like it’s restarted successfully, the app has been in the Suspended state since the timestamps at the end of the log above.

I’ve just tried to access the app and the machine has started successfully (after around 30s) and the health checks pass. The same downscalling occured after a few minutes downscale and stop the machine after a few minutes.

The startup logs look like this:

2023-06-20T12:46:10Z proxy[3d8d9349b77438] lhr [info]Starting machine
2023-06-20T12:46:10Z app[3d8d9349b77438] lhr [info] INFO Starting init (commit: 0b28cec)...
2023-06-20T12:46:10Z app[3d8d9349b77438] lhr [info] INFO Mounting /dev/vdb at /data w/ uid: 1000, gid: 1000 and chmod 0755
2023-06-20T12:46:10Z app[3d8d9349b77438] lhr [info] INFO Resized /data to 1069547520 bytes
2023-06-20T12:46:10Z app[3d8d9349b77438] lhr [info] INFO Preparing to run: `sh start.sh` as node
2023-06-20T12:46:10Z app[3d8d9349b77438] lhr [info] INFO [fly api proxy] listening at /.fly/api
2023-06-20T12:46:10Z app[3d8d9349b77438] lhr [info]2023/06/20 12:46:10 listening on [fdaa:2:5f23:a7b:13e:8e80:bb03:2]:22 (DNS: [fdaa::3]:53)
2023-06-20T12:46:10Z app[3d8d9349b77438] lhr [info]+ mkdir -p /data/database
2023-06-20T12:46:10Z app[3d8d9349b77438] lhr [info]+ mkdir -p /data/uploads
2023-06-20T12:46:10Z app[3d8d9349b77438] lhr [info]+ chmod -Rf 777 /data/database
2023-06-20T12:46:10Z app[3d8d9349b77438] lhr [info]+ chmod -Rf 777 /data/uploads
2023-06-20T12:46:10Z app[3d8d9349b77438] lhr [info]+ npx directus bootstrap
2023-06-20T12:46:10Z proxy[3d8d9349b77438] lhr [info]machine started in 419.128491ms
2023-06-20T12:46:15Z proxy[3d8d9349b77438] lhr [info]waiting for machine to be reachable on 0.0.0.0:8080 (waited 5.122468751s so far)
2023-06-20T12:46:18Z proxy[3d8d9349b77438] lhr [error]failed to connect to machine: gave up after 15 attempts (in 8.13154524s)
2023-06-20T12:46:19Z proxy[3d8d9349b77438] lhr [error]instance refused connection. is your app listening on 0.0.0.0:8080? make sure it is not only listening on 127.0.0.1 (hint: look at your startup logs, servers often print the address they are listening on)
2023-06-20T12:46:19Z app[3d8d9349b77438] lhr [info]   ╭───────────────────────────────────────────────────╮
2023-06-20T12:46:19Z app[3d8d9349b77438] lhr [info]   │                                                   │
2023-06-20T12:46:19Z app[3d8d9349b77438] lhr [info]   │                 Update available!                 │
2023-06-20T12:46:19Z app[3d8d9349b77438] lhr [info]   │                                                   │
2023-06-20T12:46:19Z app[3d8d9349b77438] lhr [info]   │                  10.2.1 → 10.3.0                  │
2023-06-20T12:46:19Z app[3d8d9349b77438] lhr [info]   │                 1 version behind                  │
2023-06-20T12:46:19Z app[3d8d9349b77438] lhr [info]   │                                                   │
2023-06-20T12:46:19Z app[3d8d9349b77438] lhr [info]   │                 More information:                 │
2023-06-20T12:46:19Z app[3d8d9349b77438] lhr [info]   │   https://github.com/directus/directus/releases   │
2023-06-20T12:46:19Z app[3d8d9349b77438] lhr [info]   │                                                   │
2023-06-20T12:46:19Z app[3d8d9349b77438] lhr [info]   ╰───────────────────────────────────────────────────╯
2023-06-20T12:46:25Z app[3d8d9349b77438] lhr [info][12:46:24.746] INFO: Initializing bootstrap...
2023-06-20T12:46:25Z app[3d8d9349b77438] lhr [info][12:46:24.791] INFO: Database already initialized, skipping install
2023-06-20T12:46:25Z app[3d8d9349b77438] lhr [info][12:46:24.791] INFO: Running migrations...
2023-06-20T12:46:25Z app[3d8d9349b77438] lhr [info][12:46:24.798] INFO: Done
2023-06-20T12:46:26Z app[3d8d9349b77438] lhr [info]+ npx directus start
2023-06-20T12:46:27Z proxy[3d8d9349b77438] lhr [error]instance refused connection. is your app listening on 0.0.0.0:8080? make sure it is not only listening on 127.0.0.1 (hint: look at your startup logs, servers often print the address they are listening on)
2023-06-20T12:46:28Z proxy[3d8d9349b77438] lhr [error]instance refused connection. is your app listening on 0.0.0.0:8080? make sure it is not only listening on 127.0.0.1 (hint: look at your startup logs, servers often print the address they are listening on)
2023-06-20T12:46:35Z proxy[3d8d9349b77438] lhr [error]instance refused connection. is your app listening on 0.0.0.0:8080? make sure it is not only listening on 127.0.0.1 (hint: look at your startup logs, servers often print the address they are listening on)
2023-06-20T12:46:36Z proxy[3d8d9349b77438] lhr [error]instance refused connection. is your app listening on 0.0.0.0:8080? make sure it is not only listening on 127.0.0.1 (hint: look at your startup logs, servers often print the address they are listening on)
2023-06-20T12:46:37Z app[3d8d9349b77438] lhr [info]   ╭───────────────────────────────────────────────────╮
2023-06-20T12:46:37Z app[3d8d9349b77438] lhr [info]   │                                                   │
2023-06-20T12:46:37Z app[3d8d9349b77438] lhr [info]   │                 Update available!                 │
2023-06-20T12:46:37Z app[3d8d9349b77438] lhr [info]   │                                                   │
2023-06-20T12:46:37Z app[3d8d9349b77438] lhr [info]   │                  10.2.1 → 10.3.0                  │
2023-06-20T12:46:37Z app[3d8d9349b77438] lhr [info]   │                 1 version behind                  │
2023-06-20T12:46:37Z app[3d8d9349b77438] lhr [info]   │                                                   │
2023-06-20T12:46:37Z app[3d8d9349b77438] lhr [info]   │                 More information:                 │
2023-06-20T12:46:37Z app[3d8d9349b77438] lhr [info]   │   https://github.com/directus/directus/releases   │
2023-06-20T12:46:37Z app[3d8d9349b77438] lhr [info]   │                                                   │
2023-06-20T12:46:37Z app[3d8d9349b77438] lhr [info]   ╰───────────────────────────────────────────────────╯
2023-06-20T12:46:41Z app[3d8d9349b77438] lhr [info][12:46:41.775] WARN: Spatialite isn't installed. Geometry type support will be limited.
2023-06-20T12:46:41Z app[3d8d9349b77438] lhr [info][12:46:41.885] INFO: Server started at http://0.0.0.0:8080
2023-06-20T12:46:43Z health[3d8d9349b77438] lhr [info]Health check on port 8080 is now passing.
2023-06-20T12:46:43Z app[3d8d9349b77438] lhr [info][12:46:43] GET /server/health 200 41ms
2023-06-20T12:46:47Z app[3d8d9349b77438] lhr [info][12:46:46] GET /admin 200 2ms
2023-06-20T12:46:47Z app[3d8d9349b77438] lhr [info][12:46:47] GET /extensions/sources/index.js 200 7ms
2023-06-20T12:46:47Z app[3d8d9349b77438] lhr [info][12:46:47] POST /auth/refresh 200 26ms
2023-06-20T12:46:47Z app[3d8d9349b77438] lhr [info][12:46:47] GET /server/info 304 12ms
2023-06-20T12:46:47Z app[3d8d9349b77438] lhr [info][12:46:47] GET /auth 304 7ms
2023-06-20T12:46:47Z app[3d8d9349b77438] lhr [info][12:46:47] GET /users/me?fields[]=email&fields[]=first_name&fields[]=last_name&fields[]=last_page 200 13ms
2023-06-20T12:46:47Z app[3d8d9349b77438] lhr [info][12:46:47] GET /admin/assets/logo-light-7a327cdd.svg 200 12ms
2023-06-20T12:46:47Z app[3d8d9349b77438] lhr [info][12:46:47] GET /admin/assets/Inter-Black-5ab3de07.woff2 200 3ms
2023-06-20T12:46:57Z health[3d8d9349b77438] lhr [info]Health check on port 8080 is now passing.
2023-06-20T12:46:58Z app[3d8d9349b77438] lhr [info][12:46:58] GET /server/health 200 12ms
2023-06-20T12:47:13Z app[3d8d9349b77438] lhr [info][12:47:13] GET /server/health 200 10ms
2023-06-20T12:47:28Z app[3d8d9349b77438] lhr [info][12:47:28] GET /server/health 200 8ms
2023-06-20T12:47:43Z app[3d8d9349b77438] lhr [info][12:47:43] GET /server/health 200 8ms
...
2023-06-20T12:52:13Z app[3d8d9349b77438] lhr [info][12:52:13] GET /server/health 200 7ms
2023-06-20T12:52:18Z proxy [3d8d9349b77438] lhr [info]Downscaling app app-name-here in region lhr. Automatically stopping machine 3d8d9349b77438. 2 instances are running, 0 are at soft limit, we only need 1 running
2023-06-20T12:52:18Z app[3d8d9349b77438] lhr [info] INFO Sending signal SIGINT to main child process w/ PID 521
2023-06-20T12:52:23Z app[3d8d9349b77438] lhr [info] INFO Sending signal SIGTERM to main child process w/ PID 521
2023-06-20T12:52:24Z app[3d8d9349b77438] lhr [info] INFO Main child exited with signal (with signal 'SIGTERM', core dumped? false)
2023-06-20T12:52:24Z app[3d8d9349b77438] lhr [info] INFO Starting clean up.
2023-06-20T12:52:24Z app[3d8d9349b77438] lhr [info] INFO Umounting /dev/vdb from /data
2023-06-20T12:52:24Z app[3d8d9349b77438] lhr [info]ERROR error umounting /data: EBUSY: Device or resource busy, retrying in a bit
2023-06-20T12:52:24Z app[3d8d9349b77438] lhr [info]ERROR error umounting /data: EBUSY: Device or resource busy, retrying in a bit
2023-06-20T12:52:25Z app[3d8d9349b77438] lhr [info]ERROR error umounting /data: EBUSY: Device or resource busy, retrying in a bit
2023-06-20T12:52:26Z app[3d8d9349b77438] lhr [info]ERROR error umounting /data: EBUSY: Device or resource busy, retrying in a bit
2023-06-20T12:52:27Z app[3d8d9349b77438] lhr [info] WARN hallpass exited, pid: 522, status: signal: 15 (SIGTERM)
2023-06-20T12:52:27Z app[3d8d9349b77438] lhr [info]2023/06/20 12:52:27 listening on [fdaa:2:5f23:a7b:13e:8e80:bb03:2]:22 (DNS: [fdaa::3]:53)
2023-06-20T12:52:28Z app[3d8d9349b77438] lhr [info][  377.594682] reboot: Restarting system
2023-06-20T12:52:30Z health[3d8d9349b77438] lhr [error]Health check on port 8080 has failed. Your app is not responding properly. Services exposed on ports [80, 443] will have intermittent failures until the health check passes.
2023-06-20T12:52:44Z health[3d8d9349b77438] lhr [error]Health check on port 8080 has failed. Your app is not responding properly. Services exposed on ports [80, 443] will have intermittent failures until the health check passes.

andie · June 20, 2023, 5:39pm

This could be an issue on our end. However, while we’re working on that, you could try scaling down to zero and redeploying to see if you can get rid of the issue where the proxy thinks you have 2 instances.

It’s a good idea to update flyctl as well if you haven’t done that already: fly version update

MattClegg · June 21, 2023, 6:13am

I tried setting min_machines_running = 0 and redeployed but the logs still say Downscaling app app-name-here in region lhr. Automatically stopping machine 3d8d9349b77438. 2 instances are running, 0 are at soft limit, we only need 1 running. Is that what you meant by scaling down to zero?

I updated my fly version and I’m running v0.1.39.

andie · June 22, 2023, 2:41am

Sorry I wasn’t totally clear, and I believe we’re working on a bug that might be causing this.

But it might be worth a try to get the app to reset so that Fly Proxy knows how many Machines your app has (right now it thinks there’s 2 Machines, but there’s only 1, which is weird!).

You could destroy the existing Machine:

fly scale count app=0

This will scale the Machines in the default app process down to zero.

Then run fly deploy. This will probably create 2 machines for redundancy. If you really only want 1 Machine, then use fly deploy --ha=false.

I’m hoping, at the very least, that this will make the proxy know how many Machines there are.

MattClegg · June 22, 2023, 6:13am

Something was definately broken with my setup! Running the scale command returned this:

App 'app-name-here' is going to be scaled according to this plan:
  -1 machines for group 'app' on region 'lhr' with size 'shared-cpu-1x'

I redeployed with the --ha=false flag and it seems to be up and running now with no scaling down! Thanks for your help, I hope you manage to fix the bug soon!

system · June 29, 2023, 6:14am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
min_machines_running config is ignored?	1	170	March 12, 2024
Fly Machines still scale to zero after updating fly.toml to min_machines_running to 1	2	235	March 15, 2024
Dockerfile stuck on "Creating Fly Machine" - then timeout without visible errors Build debugging machines	2	36	January 10, 2025
fewer started machines than min_machines_running? machines	7	336	December 24, 2023
Setting a minimum number of instances to keep running when using auto start/stop Fresh Produce	25	5325	October 30, 2024

min_machines_running = 1 not working correctly

Related topics