Fly.toml configuration does not set kill_signal

Hi! I’ve set

kill_signal = “SIGTERM”
kill_timeout = 30

in Fly.toml – in my logs I’m seeing

[info] INFO Sending signal SIGINT to main child process w/ PID 323
…30 seconds
[info] INFO Sending signal SIGTERM to main child process w/ PID 323
[warn] Virtual machine exited abruptly

(I am trapping SIGTERM in my execution script and I guess that the actions there have not had enough time to land)

The behaviour I would have expected is

no SIGNT
SIGTERM
30 seconds
SIGKILL

am i missing something?

Documentation says:

kill_signal option

When shutting down a Fly Machine, by default, Fly.io sends a SIGINT signal to the running
process. Typically this triggers a hard shutdown option on most applications. The kill_signal
option allows you to change what signal is sent so that you can trigger a softer, less disruptive
shutdown. Options are SIGINT (default), SIGTERM, SIGQUIT, SIGUSR1, SIGUSR2,
SIGKILL, or SIGSTOP. For example, to set the kill signal to SIGTERM, you would add:

kill_signal = "SIGTERM"

We are using fly machines restart to restart the machine.

Hm… I can reproduce this only with fly m restart—not with fly m stop.

(Maybe it’s this only-on-restarts aspect that is the new piece of information, relative to your post in July? I think it’s best if these are structured as a continuous flow of conversation, with everyone’s contributions magnifying the others’, rather than arriving scattershot.)

app = "thirty"
primary_region = "ewr"
kill_signal = "SIGTERM"

[[restart]]
  policy = "no"
FROM debian:bookworm-slim

COPY --chmod=755 thirty /usr/local/bin/

CMD ["thirty"]
#!/bin/bash -eup

echo thirty

function l() { echo 30: "$1" 1>&2;  sleep 0.1; }

trap 'l sigint'           SIGINT
trap 'l sigterm;  exit 0' SIGTERM

trap -p

while true; do sleep 0.1; done

And then, with fly m restart, the logs read…

22:59:59Z app[28*] ewr [info] INFO Sending signal SIGINT to main child process w/ PID 321
22:59:59Z app[28*] ewr [info]30: sigint
23:00:04Z app[28*] ewr [info] INFO Sending signal SIGTERM to main child process w/ PID 321
23:00:04Z app[28*] ewr [info]30: sigterm
23:00:05Z app[28*] ewr [info] INFO Main child exited normally with code: 0

Whereas fly m stop goes straight to SIGTERM

23:01:36Z app[28*] ewr [info] INFO Sending signal SIGTERM to main child process w/ PID 323
23:01:36Z app[28*] ewr [info]30: sigterm
23:01:37Z app[28*] ewr [info] INFO Main child exited normally with code: 0

But these cases should really be the same. It’s hard to think of a reason why stop and restart would have different shutdown mechanisms…


Aside: The odd-looking sleep 0.1 in the l function avoids a distracting stderr to vsock zero copy err: Broken pipe, which I don’t think is related. (Others have reported it, as well.)

Added duplicated, machines

yeah, sorry, I couldn’t figure out how to resurrect the issue since it died after 7 days.

For our use case we don’t particular need fly m stop. We need fly m restart to use SIGTERM, and I think it’s expected that that would be the same. In our case using SIGINT precludes an important cleanup state and our operational deploys get messed up when we reconfigure our fly nodes.

Can you provide an App Name or Machine ID? I checked the logic involved with fly m restart and it does the following to determine what signal to send to the machine:

  • default to SIGINT (which you are seeing)
  • check if machine config has a stop signal configured (which you are not seeing but should)
1 Like

olivaw-sandworm-2 / 908017eeb1e478

Thanks!

The system has the following:

      "stop": {
        "timeout": "30s",
        "signal": "SIGNAL_SIGINT"
      }

which explains the behavior you are seeing.

Given you referenced a fly.toml, I assume these machines were created via fly deploy?

yes. This is the head of our fly.toml:

app = 'olivaw-sandworm-2'
primary_region = 'sjc'
kill_timeout = '30s'
kill_signal = "SIGTERM"

[build]
  ...

oh sorry we redeployed with the old configuration. I will fly machines restart with the new configuration.

fly machine restart doesn’t modify the configuration of existing machines. If you had previously deployed via fly deploy, then changed the fly.toml to have a specific kill_signal, you would need to fly deploy again in order to replace the existing machines with a new version containing the updated stop signal.

yep sorry, I meant fly machines deploy. I’ve triggered a redeploy. This is one of our UAT environment boxes, and we don’t have any activity scheduled for today so if you’d like to proactively test restarting on the instance please feel free to.

I checked the machine and it now has the correct configuration for stop:

      "stop": {
        "timeout": "30s",
        "signal": "SIGNAL_SIGTERM"
      }