Machine does not reboot after `fly deploy`

I have a machine for a cron process that needs to stay online all the time:

[processes]
  cron = "supercronic /app/crontab"

However, when I deploy my app (fly deploy), I can see the following in my logs:

reboot: Restarting system

Yet the machine never restarts :man_shrugging:t2: I need to manually start it with fly machines start --select.

Is there a way to make sure that the machine restarts as expected? I was searching for how to set the so called “restart policy” in my fly.toml, but that seems to be a config for the app, rather than for a specific machine.

Any guidance would be appreciated.

Thank you

Hi @maxime1

You can set the restart policy to always for a Machine:
fly m update <machine ID> --restart always

Here’s a summary of the Machine restart policies so that you can verify what the right setting is for you: Issues with machines restart policy - #25 by catflydotio

I tried the above approach with no luck, it is already set to “always”.

$ fly m update 1781155ad41189 --restart always
Error: no config changes found

And to confirm, the current state shows “paused” with the following logs:

2023-07-10T19:41:48.400 runner[1781155ad41189] iad [info] Configuring firecracker
2023-07-10T19:41:48.405 app[1781155ad41189] iad [info] INFO Sending signal SIGINT to main child process w/ PID 233
2023-07-10T19:41:48.834 app[1781155ad41189] iad [info] INFO Main child exited with signal (with signal 'SIGINT', core dumped? false)
2023-07-10T19:41:48.835 app[1781155ad41189] iad [info] INFO Starting clean up.
2023-07-10T19:41:48.841 app[1781155ad41189] iad [info] WARN hallpass exited, pid: 234, status: signal: 15 (SIGTERM)
2023-07-10T19:41:48.854 app[1781155ad41189] iad [info] 2023/07/10 19:41:48 listening on [fdaa:1:b784:a7b:ab8:6033:4b56:2]:22 (DNS: [fdaa::3]:53)
2023-07-10T19:41:49.832 app[1781155ad41189] iad [info] [ 1786.244503] reboot: Restarting system

I’ve been having the same problem, here is a detailed explanation that never got any traction.

Hey @bryantbrock

Thanks for this extra info. Maybe the always restart is actually the problem here. If you run fly machine status <machine ID> you should be able to see what the exit code is when the Machine stops, which might help with figuring out the problem.

Does the worker Machine that actually works also have a restart policy of always?

Thanks for the reply @andie , here is the result of running the above command while the machine is in the limbo state:

Machine ID: 1781155ad41189
Instance ID: 01H50N5A42HMMM8ZJ23BFAH65K
State: stopped

VM
  ID            = 1781155ad41189                                                   
  Instance ID   = 01H50N5A42HMMM8ZJ23BFAH65K                                       
  State         = stopped                                                          
  Image         = client-portal-development:deployment-01H50MX50ASGGV1BEW1WJT3S74  
  Name          = crimson-wind-9803                                                
  Private IP    = fdaa:1:b784:a7b:ab8:6033:4b56:2                                  
  Region        = iad                                                              
  Process Group = worker-pdf                                                       
  CPU Kind      = shared                                                           
  vCPUs         = 1                                                                
  Memory        = 2048                                                             
  Created       = 2023-06-29T20:29:19Z                                             
  Updated       = 2023-07-10T19:41:50Z                                             
  Entrypoint    =                                                                  
  Command       = ["yarn","worker:pdf"]                                            

Event Logs
STATE   EVENT   SOURCE  TIMESTAMP                       INFO 
stopped update  flyd    2023-07-10T14:41:50.31-05:00 
created launch  user    2023-07-10T14:41:23.469-05:00

Would the status show under the INFO column of Event Logs?

The other machine does not appear to have a restart policy.

Configuration changes to be applied to machine: 3d8d501fe51e98 (falling-dream-2883)

        ... // 18 identical lines
          },
          "image": "...",
-         "restart": {},
+         "restart": {
+           "policy": "always"
+         },
          "guest": {
            "cpu_kind": "shared",
        ... // 8 identical lines

I do believe the currently failing machine also didn’t come with a restart policy, I just added it in an attempt to fix this issue.

UPDATE: not to repeat myself too much, but I just created a new environment and it is happening to both of the workers I have been discussing here. Both have no restart policy and are stuck in this limbo state.

1 Like

Thanks for providing the output. When you run fly machine status <machine ID> do the event logs ever show starting or started events at all for any Machine?

Hi @andie

Here’s some more info:

Status before deploy:

Event Logs
STATE           EVENT   SOURCE  TIMESTAMP                       INFO 
started         start   flyd    2023-07-11T17:35:17.37-04:00 
starting        start   user    2023-07-11T17:35:17.052-04:00
stopped         update  flyd    2023-07-11T17:24:16.18-04:00 
created         launch  user    2023-07-11T17:23:59.745-04:00

Status after deploy:

Event Logs
STATE   EVENT   SOURCE  TIMESTAMP                       INFO 
stopped update  flyd    2023-07-12T12:01:45.771-04:00
created launch  user    2023-07-12T12:01:42.518-04:00

Status after fly m start --select

Event Logs
STATE           EVENT   SOURCE  TIMESTAMP                       INFO 
started         start   flyd    2023-07-12T12:03:29.264-04:00
starting        start   user    2023-07-12T12:03:28.918-04:00
stopped         update  flyd    2023-07-12T12:01:45.771-04:00
created         launch  user    2023-07-12T12:01:42.518-04:00

It wouldn’t be a complete answer, but I’m still curious: does that process depend on another process being up in order to boot properly?

No, I have a single independent process:

app = "appName"
primary_region = "yul"

[mounts]
  source="mypersistetvolume"
  destination="/data"
  processes = ["cron"]

[processes]
  cron = "supercronic /app/crontab"

@bryantbrock It looks as though the Machine that’s not starting began life as a standby machine. Standby Machines don’t start automatically unless the Machine they’re a standby for goes down.

@maxime1 Might this be that case with your Machine that’s not starting as well?

(Hat tip to @leslie)

You can check this with fly status and see if there’s a line like:

app†    9e784e26c01183  7       yyz     stopped                 2023-05-21T15:42:00Z

  † Standby machine (it will take over only in case of host hardware failure)

You can remove the standby status on a Machine with fly m update <vm-id> --standby-for="" – or scale that process group to 0 and run fly deploy again to get one active Machine and a standby for it, if you want the peace of mind of having the standby.

2 Likes

OMG THANK YOU! we’ve been facing this same issue for a few of our apps that are background workers, where there are no health checks. They weren’t coming back up after deploys (after moving to v2)

3 Likes

Yes, that did it :smiley:

I had removed the other machine, because I preferred having only one machine for my app.

Thanks!

1 Like

Explains the headaches, thanks @catflydotio! I ended up going with the scale to 0 option (fly scale count 0 --process-group <group>) and then another deploy, keeping the standby machines around.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.