min_machines_running = 1 not working correctly

I’ve set min_machines_running = 1 in my .toml file to ensure there’s always one inctance of my app running, but it doesn’t seem to be working.

I’m trying to run a Directus instance but the startup times are too slow to have it dynamically scale down to 0. It runs fine for about 6 minutes then it tries to downscale, even though I’ve set min_machines_running to 1. I’ve taken inspiration from this example repo: https://github.com/freekrai/directus-fly/blob/main/fly.toml.

My .toml file looks like this:

# fly.toml app configuration file generated for app-name-goes-here on 2023-06-18T07:20:56+01:00
#
# See https://fly.io/docs/reference/configuration/ for information about how to use this file.
#

app = "app-name-goes-here"
kill_signal = "SIGINT"
kill_timeout = 15
primary_region = "lhr"

[env]
  DB_CLIENT="sqlite3"
  DB_FILENAME="/data/database/data.db"
  STORAGE_LOCATIONS="local"
  STORAGE_LOCAL_DRIVER="local"
  STORAGE_LOCAL_ROOT="/data/uploads"
  PUBLIC_URL="https://url-goes-here.fly.dev"
  PORT=8080

[experimental]
  allowed_public_ports = []
  auto_rollback = true
  cmd = "start.sh"
  entrypoint = "sh"

[build]
  dockerfile = ".\\Dockerfile"

[mounts]
  source="directus_data"
  destination="/data"

[[services]]
  internal_port = 8080
  processes = ["app"]
  auto_stop_machines = true
  auto_start_machines = true
  protocol = "tcp"
  min_machines_running = 1
  script_checks = []
  [services.concurrency]
    hard_limit = 25
    soft_limit = 20
    type = "connections"

  [[services.http_checks]]
    grace_period = "30s"
    interval = "15s"
    method = "get"
    path = "/server/health"
    protocol = "http"
    timeout = 2000
    tls_skip_verify = false
    [services.http_checks.headers]

  [[services.ports]]
    force_https = true
    handlers = ["http"]
    port = 80

  [[services.ports]]
    handlers = ["tls", "http"]
    port = 443

  [[services.tcp_checks]]
    grace_period = "30s"
    interval = "15s"
    restart_limit = 0
    timeout = "2s"

The Dockerfile and start.sh file are very minimal:

FROM directus/directus:10.3

USER node
WORKDIR /directus

COPY . .

CMD ["bash", "start.sh"]
# This file is how Fly starts the server (configured in fly.toml). Before starting
# the server though, we need to run any migrations that haven't yet been
# run, which is why this file exists in the first place.
# Learn more: https://community.fly.io/t/sqlite-not-getting-setup-properly/4386

#!/bin/sh

set -ex
mkdir -p /data/database
mkdir -p /data/uploads
chmod -Rf 777 /data/database
chmod -Rf 777 /data/uploads


npx directus bootstrap
npx directus start

These are the logs when the machine starts to downscale:

2023-06-19T06:03:29.054 app[3d8d9349b77438] lhr [info] [06:03:29] GET /server/health 200 8ms
2023-06-19T06:03:44.120 app[3d8d9349b77438] lhr [info] [06:03:44] GET /server/health 200 8ms
2023-06-19T06:03:53.515 proxy [3d8d9349b77438] lhr [info] Downscaling app app-name-goes-here in region lhr. Automatically stopping machine 3d8d9349b77438. 2 instances are running, 0 are at soft limit, we only need 1 running
2023-06-19T06:03:53.521 app[3d8d9349b77438] lhr [info] INFO Sending signal SIGINT to main child process w/ PID 521
2023-06-19T06:03:58.677 app[3d8d9349b77438] lhr [info] INFO Sending signal SIGTERM to main child process w/ PID 521
2023-06-19T06:03:58.973 app[3d8d9349b77438] lhr [info] INFO Main child exited with signal (with signal 'SIGTERM', core dumped? false)
2023-06-19T06:03:58.974 app[3d8d9349b77438] lhr [info] INFO Starting clean up.
2023-06-19T06:03:58.974 app[3d8d9349b77438] lhr [info] INFO Umounting /dev/vdb from /data
2023-06-19T06:03:58.975 app[3d8d9349b77438] lhr [info] ERROR error umounting /data: EBUSY: Device or resource busy, retrying in a bit
2023-06-19T06:03:59.185 app[3d8d9349b77438] lhr [info] [06:03:59] GET /server/health 200 8ms
2023-06-19T06:03:59.728 app[3d8d9349b77438] lhr [info] ERROR error umounting /data: EBUSY: Device or resource busy, retrying in a bit
2023-06-19T06:04:00.480 app[3d8d9349b77438] lhr [info] ERROR error umounting /data: EBUSY: Device or resource busy, retrying in a bit
2023-06-19T06:04:01.232 app[3d8d9349b77438] lhr [info] ERROR error umounting /data: EBUSY: Device or resource busy, retrying in a bit
2023-06-19T06:04:01.987 app[3d8d9349b77438] lhr [info] WARN hallpass exited, pid: 522, status: signal: 15 (SIGTERM)
2023-06-19T06:04:02.000 app[3d8d9349b77438] lhr [info] 2023/06/19 06:04:01 listening on [fdaa:2:5f23:a7b:13e:8e80:bb03:2]:22 (DNS: [fdaa::3]:53)
2023-06-19T06:04:02.985 app[3d8d9349b77438] lhr [info] [ 487.722219] reboot: Restarting system
2023-06-19T06:04:15.341 health[3d8d9349b77438] lhr [error] Health check on port 8080 has failed. Your app is not responding properly. Services exposed on ports [80, 443] will have intermittent failures until the health check passes.
2023-06-19T06:04:15.341 health[3d8d9349b77438] lhr [error] Health check on port 8080 has failed. Your app is not responding properly. Services exposed on ports [80, 443] will have intermittent failures until the health check passes. 

Have I missed something in the configuration to keep one instance always alive? I’m not sure why it says there are 2 instances running when I’ve set the scaling count to 1.

Hi @MattClegg

When you say it’s not working correctly, do you mean your app is scaling down to zero Machines (instances)?

If you run fly status is there only one Machine?

I’m not sure about the unmounting error, but it looks like you have 2 Machines and the Fly Proxy is trying to downscale to 1 per the configuration (auto_stop_machines and min_machines_running).

fly launch creates 2 machines by default for redundancy. More info here:

Hi @andie

Yeah it scales down to zero after several minutes.

When I run fly status I see one machine, I was expecting that one machine to stay active if I set min_machines_running = 1, unless I’ve missundersood how that works.
image

I ran fly scale count 1 after I created the app.

Does it always have the same log message?

2 instances are running, 0 are at soft limit, we only need 1 running

even though you only have 1 Machine? Or is there a different message?

And has your app restarted successfully or are the health checks failing also a separate issue?

Yeah, I’ve only ever seen 2 instances are running logged.

It doesn’t look like it’s restarted successfully, the app has been in the Suspended state since the timestamps at the end of the log above.

I’ve just tried to access the app and the machine has started successfully (after around 30s) and the health checks pass. The same downscalling occured after a few minutes downscale and stop the machine after a few minutes.

The startup logs look like this:

2023-06-20T12:46:10Z proxy[3d8d9349b77438] lhr [info]Starting machine
2023-06-20T12:46:10Z app[3d8d9349b77438] lhr [info] INFO Starting init (commit: 0b28cec)...
2023-06-20T12:46:10Z app[3d8d9349b77438] lhr [info] INFO Mounting /dev/vdb at /data w/ uid: 1000, gid: 1000 and chmod 0755
2023-06-20T12:46:10Z app[3d8d9349b77438] lhr [info] INFO Resized /data to 1069547520 bytes
2023-06-20T12:46:10Z app[3d8d9349b77438] lhr [info] INFO Preparing to run: `sh start.sh` as node
2023-06-20T12:46:10Z app[3d8d9349b77438] lhr [info] INFO [fly api proxy] listening at /.fly/api
2023-06-20T12:46:10Z app[3d8d9349b77438] lhr [info]2023/06/20 12:46:10 listening on [fdaa:2:5f23:a7b:13e:8e80:bb03:2]:22 (DNS: [fdaa::3]:53)
2023-06-20T12:46:10Z app[3d8d9349b77438] lhr [info]+ mkdir -p /data/database
2023-06-20T12:46:10Z app[3d8d9349b77438] lhr [info]+ mkdir -p /data/uploads
2023-06-20T12:46:10Z app[3d8d9349b77438] lhr [info]+ chmod -Rf 777 /data/database
2023-06-20T12:46:10Z app[3d8d9349b77438] lhr [info]+ chmod -Rf 777 /data/uploads
2023-06-20T12:46:10Z app[3d8d9349b77438] lhr [info]+ npx directus bootstrap
2023-06-20T12:46:10Z proxy[3d8d9349b77438] lhr [info]machine started in 419.128491ms
2023-06-20T12:46:15Z proxy[3d8d9349b77438] lhr [info]waiting for machine to be reachable on 0.0.0.0:8080 (waited 5.122468751s so far)
2023-06-20T12:46:18Z proxy[3d8d9349b77438] lhr [error]failed to connect to machine: gave up after 15 attempts (in 8.13154524s)
2023-06-20T12:46:19Z proxy[3d8d9349b77438] lhr [error]instance refused connection. is your app listening on 0.0.0.0:8080? make sure it is not only listening on 127.0.0.1 (hint: look at your startup logs, servers often print the address they are listening on)
2023-06-20T12:46:19Z app[3d8d9349b77438] lhr [info]   ╭───────────────────────────────────────────────────╮
2023-06-20T12:46:19Z app[3d8d9349b77438] lhr [info]   │                                                   │
2023-06-20T12:46:19Z app[3d8d9349b77438] lhr [info]   │                 Update available!                 │
2023-06-20T12:46:19Z app[3d8d9349b77438] lhr [info]   │                                                   │
2023-06-20T12:46:19Z app[3d8d9349b77438] lhr [info]   │                  10.2.1 → 10.3.0                  │
2023-06-20T12:46:19Z app[3d8d9349b77438] lhr [info]   │                 1 version behind                  │
2023-06-20T12:46:19Z app[3d8d9349b77438] lhr [info]   │                                                   │
2023-06-20T12:46:19Z app[3d8d9349b77438] lhr [info]   │                 More information:                 │
2023-06-20T12:46:19Z app[3d8d9349b77438] lhr [info]   │   https://github.com/directus/directus/releases   │
2023-06-20T12:46:19Z app[3d8d9349b77438] lhr [info]   │                                                   │
2023-06-20T12:46:19Z app[3d8d9349b77438] lhr [info]   ╰───────────────────────────────────────────────────╯
2023-06-20T12:46:25Z app[3d8d9349b77438] lhr [info][12:46:24.746] INFO: Initializing bootstrap...
2023-06-20T12:46:25Z app[3d8d9349b77438] lhr [info][12:46:24.791] INFO: Database already initialized, skipping install
2023-06-20T12:46:25Z app[3d8d9349b77438] lhr [info][12:46:24.791] INFO: Running migrations...
2023-06-20T12:46:25Z app[3d8d9349b77438] lhr [info][12:46:24.798] INFO: Done
2023-06-20T12:46:26Z app[3d8d9349b77438] lhr [info]+ npx directus start
2023-06-20T12:46:27Z proxy[3d8d9349b77438] lhr [error]instance refused connection. is your app listening on 0.0.0.0:8080? make sure it is not only listening on 127.0.0.1 (hint: look at your startup logs, servers often print the address they are listening on)
2023-06-20T12:46:28Z proxy[3d8d9349b77438] lhr [error]instance refused connection. is your app listening on 0.0.0.0:8080? make sure it is not only listening on 127.0.0.1 (hint: look at your startup logs, servers often print the address they are listening on)
2023-06-20T12:46:35Z proxy[3d8d9349b77438] lhr [error]instance refused connection. is your app listening on 0.0.0.0:8080? make sure it is not only listening on 127.0.0.1 (hint: look at your startup logs, servers often print the address they are listening on)
2023-06-20T12:46:36Z proxy[3d8d9349b77438] lhr [error]instance refused connection. is your app listening on 0.0.0.0:8080? make sure it is not only listening on 127.0.0.1 (hint: look at your startup logs, servers often print the address they are listening on)
2023-06-20T12:46:37Z app[3d8d9349b77438] lhr [info]   ╭───────────────────────────────────────────────────╮
2023-06-20T12:46:37Z app[3d8d9349b77438] lhr [info]   │                                                   │
2023-06-20T12:46:37Z app[3d8d9349b77438] lhr [info]   │                 Update available!                 │
2023-06-20T12:46:37Z app[3d8d9349b77438] lhr [info]   │                                                   │
2023-06-20T12:46:37Z app[3d8d9349b77438] lhr [info]   │                  10.2.1 → 10.3.0                  │
2023-06-20T12:46:37Z app[3d8d9349b77438] lhr [info]   │                 1 version behind                  │
2023-06-20T12:46:37Z app[3d8d9349b77438] lhr [info]   │                                                   │
2023-06-20T12:46:37Z app[3d8d9349b77438] lhr [info]   │                 More information:                 │
2023-06-20T12:46:37Z app[3d8d9349b77438] lhr [info]   │   https://github.com/directus/directus/releases   │
2023-06-20T12:46:37Z app[3d8d9349b77438] lhr [info]   │                                                   │
2023-06-20T12:46:37Z app[3d8d9349b77438] lhr [info]   ╰───────────────────────────────────────────────────╯
2023-06-20T12:46:41Z app[3d8d9349b77438] lhr [info][12:46:41.775] WARN: Spatialite isn't installed. Geometry type support will be limited.
2023-06-20T12:46:41Z app[3d8d9349b77438] lhr [info][12:46:41.885] INFO: Server started at http://0.0.0.0:8080
2023-06-20T12:46:43Z health[3d8d9349b77438] lhr [info]Health check on port 8080 is now passing.
2023-06-20T12:46:43Z app[3d8d9349b77438] lhr [info][12:46:43] GET /server/health 200 41ms
2023-06-20T12:46:47Z app[3d8d9349b77438] lhr [info][12:46:46] GET /admin 200 2ms
2023-06-20T12:46:47Z app[3d8d9349b77438] lhr [info][12:46:47] GET /extensions/sources/index.js 200 7ms
2023-06-20T12:46:47Z app[3d8d9349b77438] lhr [info][12:46:47] POST /auth/refresh 200 26ms
2023-06-20T12:46:47Z app[3d8d9349b77438] lhr [info][12:46:47] GET /server/info 304 12ms
2023-06-20T12:46:47Z app[3d8d9349b77438] lhr [info][12:46:47] GET /auth 304 7ms
2023-06-20T12:46:47Z app[3d8d9349b77438] lhr [info][12:46:47] GET /users/me?fields[]=email&fields[]=first_name&fields[]=last_name&fields[]=last_page 200 13ms
2023-06-20T12:46:47Z app[3d8d9349b77438] lhr [info][12:46:47] GET /admin/assets/logo-light-7a327cdd.svg 200 12ms
2023-06-20T12:46:47Z app[3d8d9349b77438] lhr [info][12:46:47] GET /admin/assets/Inter-Black-5ab3de07.woff2 200 3ms
2023-06-20T12:46:57Z health[3d8d9349b77438] lhr [info]Health check on port 8080 is now passing.
2023-06-20T12:46:58Z app[3d8d9349b77438] lhr [info][12:46:58] GET /server/health 200 12ms
2023-06-20T12:47:13Z app[3d8d9349b77438] lhr [info][12:47:13] GET /server/health 200 10ms
2023-06-20T12:47:28Z app[3d8d9349b77438] lhr [info][12:47:28] GET /server/health 200 8ms
2023-06-20T12:47:43Z app[3d8d9349b77438] lhr [info][12:47:43] GET /server/health 200 8ms
...
2023-06-20T12:52:13Z app[3d8d9349b77438] lhr [info][12:52:13] GET /server/health 200 7ms
2023-06-20T12:52:18Z proxy [3d8d9349b77438] lhr [info]Downscaling app app-name-here in region lhr. Automatically stopping machine 3d8d9349b77438. 2 instances are running, 0 are at soft limit, we only need 1 running
2023-06-20T12:52:18Z app[3d8d9349b77438] lhr [info] INFO Sending signal SIGINT to main child process w/ PID 521
2023-06-20T12:52:23Z app[3d8d9349b77438] lhr [info] INFO Sending signal SIGTERM to main child process w/ PID 521
2023-06-20T12:52:24Z app[3d8d9349b77438] lhr [info] INFO Main child exited with signal (with signal 'SIGTERM', core dumped? false)
2023-06-20T12:52:24Z app[3d8d9349b77438] lhr [info] INFO Starting clean up.
2023-06-20T12:52:24Z app[3d8d9349b77438] lhr [info] INFO Umounting /dev/vdb from /data
2023-06-20T12:52:24Z app[3d8d9349b77438] lhr [info]ERROR error umounting /data: EBUSY: Device or resource busy, retrying in a bit
2023-06-20T12:52:24Z app[3d8d9349b77438] lhr [info]ERROR error umounting /data: EBUSY: Device or resource busy, retrying in a bit
2023-06-20T12:52:25Z app[3d8d9349b77438] lhr [info]ERROR error umounting /data: EBUSY: Device or resource busy, retrying in a bit
2023-06-20T12:52:26Z app[3d8d9349b77438] lhr [info]ERROR error umounting /data: EBUSY: Device or resource busy, retrying in a bit
2023-06-20T12:52:27Z app[3d8d9349b77438] lhr [info] WARN hallpass exited, pid: 522, status: signal: 15 (SIGTERM)
2023-06-20T12:52:27Z app[3d8d9349b77438] lhr [info]2023/06/20 12:52:27 listening on [fdaa:2:5f23:a7b:13e:8e80:bb03:2]:22 (DNS: [fdaa::3]:53)
2023-06-20T12:52:28Z app[3d8d9349b77438] lhr [info][  377.594682] reboot: Restarting system
2023-06-20T12:52:30Z health[3d8d9349b77438] lhr [error]Health check on port 8080 has failed. Your app is not responding properly. Services exposed on ports [80, 443] will have intermittent failures until the health check passes.
2023-06-20T12:52:44Z health[3d8d9349b77438] lhr [error]Health check on port 8080 has failed. Your app is not responding properly. Services exposed on ports [80, 443] will have intermittent failures until the health check passes.

This could be an issue on our end. However, while we’re working on that, you could try scaling down to zero and redeploying to see if you can get rid of the issue where the proxy thinks you have 2 instances.

It’s a good idea to update flyctl as well if you haven’t done that already: fly version update

I tried setting min_machines_running = 0 and redeployed but the logs still say Downscaling app app-name-here in region lhr. Automatically stopping machine 3d8d9349b77438. 2 instances are running, 0 are at soft limit, we only need 1 running. Is that what you meant by scaling down to zero?

I updated my fly version and I’m running v0.1.39.

Sorry I wasn’t totally clear, and I believe we’re working on a bug that might be causing this.

But it might be worth a try to get the app to reset so that Fly Proxy knows how many Machines your app has (right now it thinks there’s 2 Machines, but there’s only 1, which is weird!).

You could destroy the existing Machine:

fly scale count app=0

This will scale the Machines in the default app process down to zero.

Then run fly deploy. This will probably create 2 machines for redundancy. If you really only want 1 Machine, then use fly deploy --ha=false.

I’m hoping, at the very least, that this will make the proxy know how many Machines there are. :slight_smile:

Something was definately broken with my setup! Running the scale command returned this:

App 'app-name-here' is going to be scaled according to this plan:
  -1 machines for group 'app' on region 'lhr' with size 'shared-cpu-1x'

I redeployed with the --ha=false flag and it seems to be up and running now with no scaling down! Thanks for your help, I hope you manage to fix the bug soon!

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.