scale count 15 but eventually no instances running (503 error)

Hi, it seems that there is an implicit restart policy triggering restart of my app’s instances very frequently. I currently have the following scale status:

VM Resources for sync-server
        VM Size: dedicated-cpu-1x
      VM Memory: 2 GB
          Count: 15
 Max Per Region: Not set

this is my fly-production.toml’s services section:

  http_checks = []
  internal_port = 8080
  processes = ["app"]
  protocol = "tcp"
  script_checks = []

    hard_limit = 200
    soft_limit = 170
    type = "connections"

    force_https = true
    handlers = ["http"]
    port = 80

    handlers = ["tls", "http"]
    port = 443

    grace_period = "5s"
    interval = "30s"
    restart_limit = 0
    timeout = "10s"

vm memory is not an issue (<10% at all times). This app is very slow (currently working on an update) so I guess some sort of http/tcp check might be triggering restarts.

What I see happening is that fly restarts the instance (not intended), and when it reaches a certain amount of restarts (some stopped at 4, other at 9), it kills the instance and provides a fresh one. This happens for all instances, all the time, and eventually fly stops providing fresh instances, reaching a point where I see the 503 error: “no instances to route to”.

If you’re talking about the restarts that show in fly status, those are not always triggered by us. That counter means the app process exited and we started it back up.

The only time we do trigger restarts is if health checks fail repeatedly. You can disable that by adding restart_limit = 0 to the health check in services.

When there are multiple restarts in a specific interval, we replace the whole VM.

Is it possible the 10s timeout on the tcp check is too low?

If you run fly vm status <id> on one of those instances, you should be able to tell if the restart was because the process exited, or because the VM wasn’t healthy.

Hi Kurt, thanks for your answer. Yes, my app’s instances were restarting due to an unhandled exception caused by a specific timeout we set on our database which the app connects to. We’ll fix that.