Explain `restart_limit` in docs

Can you please explain restart_limit (from services.tcp_checks) in the docs? The fly.toml page does not mention it.

1 Like

It’s the number of times we’re try to restart your app after a crash before giving up. If it’s not set fly will try to restart i the app infinitely.

The docs now says “The number of consecutive TCP check failures to allow before attempting to restart the VM. The default is 0, which disables restarts based on failed TCP health checks.”

Does this mean that checks do nothing by default? is this a good default?

1 Like

I wouldn’t say the checks do nothing: if you do want a check to do nothing, you would not add one at all in your fly.toml. That would be the do-nothing option. As Fly would not know to do any healthcheck, so it won’t. :slight_smile:

Next, if you add a healthcheck but don’t specify restart_limit, well the app won’t deploy if that healthcheck fails. As that runs as part of the deploy. So that’s where it applies and does something.

And then if you specify a restart_limit value, you can choose how many failures to allow. Because some apps may be expected to fail maybe once or twice and not need a vm restart to resolve it. For example if they do external calls, which the vm being restarted would not fix. But e.g if it’s a nodejs crash, well in that case you would need the vm restarting on just one failure.

1 Like
[[services]]
  http_checks = []
  internal_port = 8080
  processes = ["app", "consumer"]
  protocol = "tcp"
  script_checks = []

  [services.concurrency]
    hard_limit = 25
    soft_limit = 20
    type = "connections"

  [[services.ports]]
    handlers = ["http"]
    port = 80
    force_https = true

  [[services.ports]]
    handlers = ["tls", "http"]
    port = 443

  [[services.tcp_checks]]
    grace_period = "1s"
    interval = "15s"
    restart_limit = 0
    timeout = "2s"

My restart_limit is 0 however in the logs I see the following:

2023-08-31T16:12:27.247 runner[…] fra [info] machine has reached its max restart count (10)

Why would it restart 10 times when 0 was specified?

This is specifically from an error being thrown in a node process

Hi @moishinetzer

The restart_limit setting only applies to V1 (Nomad) apps. The log refers to a “machine”, so your app is V2.
What you’re seeing is the result of the default Machine restart policy. The default is to keep attempting a restart up to 10 times after a failure. Here’s some info about Machine restart policies: Issues with machines restart policy - #25 by catflydotio

Thank you for providing that info!

Is it planned to support setting the new V2 app equivalent of that setting, the machine restart.policy, via the fly.toml config file? I think this would be really handy.