Explain `restart_limit` in docs

ChristianSiegert · August 5, 2021, 3:29pm

Can you please explain restart_limit (from services.tcp_checks) in the docs? The fly.toml page does not mention it.

rugwiro · August 6, 2021, 8:08am

It’s the number of times we’re try to restart your app after a crash before giving up. If it’s not set fly will try to restart i the app infinitely.

morse · June 13, 2022, 2:15pm

The docs now says “The number of consecutive TCP check failures to allow before attempting to restart the VM. The default is 0, which disables restarts based on failed TCP health checks.”

Does this mean that checks do nothing by default? is this a good default?

greg · June 13, 2022, 2:31pm

I wouldn’t say the checks do nothing: if you do want a check to do nothing, you would not add one at all in your fly.toml. That would be the do-nothing option. As Fly would not know to do any healthcheck, so it won’t.

Next, if you add a healthcheck but don’t specify restart_limit, well the app won’t deploy if that healthcheck fails. As that runs as part of the deploy. So that’s where it applies and does something.

And then if you specify a restart_limit value, you can choose how many failures to allow. Because some apps may be expected to fail maybe once or twice and not need a vm restart to resolve it. For example if they do external calls, which the vm being restarted would not fix. But e.g if it’s a nodejs crash, well in that case you would need the vm restarting on just one failure.

moishinetzer · August 31, 2023, 4:16pm

[[services]]
  http_checks = []
  internal_port = 8080
  processes = ["app", "consumer"]
  protocol = "tcp"
  script_checks = []

  [services.concurrency]
    hard_limit = 25
    soft_limit = 20
    type = "connections"

  [[services.ports]]
    handlers = ["http"]
    port = 80
    force_https = true

  [[services.ports]]
    handlers = ["tls", "http"]
    port = 443

  [[services.tcp_checks]]
    grace_period = "1s"
    interval = "15s"
    restart_limit = 0
    timeout = "2s"

My restart_limit is 0 however in the logs I see the following:

2023-08-31T16:12:27.247 runner[…] fra [info] machine has reached its max restart count (10)

Why would it restart 10 times when 0 was specified?

This is specifically from an error being thrown in a node process

andie · August 31, 2023, 5:11pm

Hi @moishinetzer

The restart_limit setting only applies to V1 (Nomad) apps. The log refers to a “machine”, so your app is V2.
What you’re seeing is the result of the default Machine restart policy. The default is to keep attempting a restart up to 10 times after a failure. Here’s some info about Machine restart policies: Issues with machines restart policy - #25 by catflydotio

salimb · November 25, 2023, 7:56pm

Thank you for providing that info!

Is it planned to support setting the new V2 app equivalent of that setting, the machine restart.policy, via the fly.toml config file? I think this would be really handy.

Topic		Replies	Views
HTTP Health checks failing, but not restarting app	5	1028	July 25, 2023
scale count 15 but eventually no instances running (503 error) Questions / Help docs	2	609	December 16, 2022
Question Restart config in Fly.toml - Retries option Questions / Help	1	21	November 7, 2024
App Shutting Down and won't restart?	4	779	June 27, 2022
Unexpected Restarts metrics	3	753	September 17, 2020

Explain `restart_limit` in docs

Related topics