I have a machine for a cron process that needs to stay online all the time:
[processes]
cron = "supercronic /app/crontab"
However, when I deploy my app (fly deploy), I can see the following in my logs:
reboot: Restarting system
Yet the machine never restarts I need to manually start it with fly machines start --select.
Is there a way to make sure that the machine restarts as expected? I was searching for how to set the so called “restart policy” in my fly.toml, but that seems to be a config for the app, rather than for a specific machine.
I tried the above approach with no luck, it is already set to “always”.
$ fly m update 1781155ad41189 --restart always
Error: no config changes found
And to confirm, the current state shows “paused” with the following logs:
2023-07-10T19:41:48.400 runner[1781155ad41189] iad [info] Configuring firecracker
2023-07-10T19:41:48.405 app[1781155ad41189] iad [info] INFO Sending signal SIGINT to main child process w/ PID 233
2023-07-10T19:41:48.834 app[1781155ad41189] iad [info] INFO Main child exited with signal (with signal 'SIGINT', core dumped? false)
2023-07-10T19:41:48.835 app[1781155ad41189] iad [info] INFO Starting clean up.
2023-07-10T19:41:48.841 app[1781155ad41189] iad [info] WARN hallpass exited, pid: 234, status: signal: 15 (SIGTERM)
2023-07-10T19:41:48.854 app[1781155ad41189] iad [info] 2023/07/10 19:41:48 listening on [fdaa:1:b784:a7b:ab8:6033:4b56:2]:22 (DNS: [fdaa::3]:53)
2023-07-10T19:41:49.832 app[1781155ad41189] iad [info] [ 1786.244503] reboot: Restarting system
I’ve been having the same problem, here is a detailed explanation that never got any traction.
Thanks for this extra info. Maybe the always restart is actually the problem here. If you run fly machine status <machine ID> you should be able to see what the exit code is when the Machine stops, which might help with figuring out the problem.
Does the worker Machine that actually works also have a restart policy of always?
Thanks for the reply @andie , here is the result of running the above command while the machine is in the limbo state:
Machine ID: 1781155ad41189
Instance ID: 01H50N5A42HMMM8ZJ23BFAH65K
State: stopped
VM
ID = 1781155ad41189
Instance ID = 01H50N5A42HMMM8ZJ23BFAH65K
State = stopped
Image = client-portal-development:deployment-01H50MX50ASGGV1BEW1WJT3S74
Name = crimson-wind-9803
Private IP = fdaa:1:b784:a7b:ab8:6033:4b56:2
Region = iad
Process Group = worker-pdf
CPU Kind = shared
vCPUs = 1
Memory = 2048
Created = 2023-06-29T20:29:19Z
Updated = 2023-07-10T19:41:50Z
Entrypoint =
Command = ["yarn","worker:pdf"]
Event Logs
STATE EVENT SOURCE TIMESTAMP INFO
stopped update flyd 2023-07-10T14:41:50.31-05:00
created launch user 2023-07-10T14:41:23.469-05:00
Would the status show under the INFO column of Event Logs?
The other machine does not appear to have a restart policy.
I do believe the currently failing machine also didn’t come with a restart policy, I just added it in an attempt to fix this issue.
UPDATE: not to repeat myself too much, but I just created a new environment and it is happening to both of the workers I have been discussing here. Both have no restart policy and are stuck in this limbo state.
Thanks for providing the output. When you run fly machine status <machine ID> do the event logs ever show starting or started events at all for any Machine?
Event Logs
STATE EVENT SOURCE TIMESTAMP INFO
started start flyd 2023-07-11T17:35:17.37-04:00
starting start user 2023-07-11T17:35:17.052-04:00
stopped update flyd 2023-07-11T17:24:16.18-04:00
created launch user 2023-07-11T17:23:59.745-04:00
Status after deploy:
Event Logs
STATE EVENT SOURCE TIMESTAMP INFO
stopped update flyd 2023-07-12T12:01:45.771-04:00
created launch user 2023-07-12T12:01:42.518-04:00
Status after fly m start --select
Event Logs
STATE EVENT SOURCE TIMESTAMP INFO
started start flyd 2023-07-12T12:03:29.264-04:00
starting start user 2023-07-12T12:03:28.918-04:00
stopped update flyd 2023-07-12T12:01:45.771-04:00
created launch user 2023-07-12T12:01:42.518-04:00
@bryantbrock It looks as though the Machine that’s not starting began life as a standby machine. Standby Machines don’t start automatically unless the Machine they’re a standby for goes down.
@maxime1 Might this be that case with your Machine that’s not starting as well?
You can check this with fly status and see if there’s a line like:
app†9e784e26c01183 7 yyz stopped 2023-05-21T15:42:00Z
†Standby machine (it will take over only in case of host hardware failure)
You can remove the standby status on a Machine with fly m update <vm-id> --standby-for="" – or scale that process group to 0 and run fly deploy again to get one active Machine and a standby for it, if you want the peace of mind of having the standby.
OMG THANK YOU! we’ve been facing this same issue for a few of our apps that are background workers, where there are no health checks. They weren’t coming back up after deploys (after moving to v2)
Explains the headaches, thanks @catflydotio! I ended up going with the scale to 0 option (fly scale count 0 --process-group <group>) and then another deploy, keeping the standby machines around.