Some Machines take 6 minutes to be destroyed after request to destroy VM

Some Machines take several minutes after I call destroy vm to receive the signal and actually terminate. Is this expected? Should I be calling the wait API to wait for the machine to actually be destroyed? This is causing issues during my deployments which is resulting in different boxes having multiple deployed versions of the code.

My app name: 01gvfwwmq3dfyzsfdyc7hy1ker
Machine ID: 148e42ef7d5189

Backend server logs

cakework-controlplane [GIN] 2023/03/14 - 12:25:23 | 200 |   492.13255ms |    3.230.163.83 | POST     "/v1/vm/148e42ef7d5189/stop

Fly machine logs

2023-03-14 03:31:00.092 [fly] info sjc 7f3c 148e42ef7d5189 01gvfwwmq3dfyzsfdyc7hy1ker [ 4946.714940] reboot: Restarting system
2023-03-14 03:31:00.092 [fly] info sjc 7f3c 148e42ef7d5189 01gvfwwmq3dfyzsfdyc7hy1ker Sending signal SIGKILL to main child process w/ PID 513
2023-03-14 03:31:00.092 [fly] info sjc 7f3c 148e42ef7d5189 01gvfwwmq3dfyzsfdyc7hy1ker Starting clean up.

The request to destroy the machine doesn’t look to be delayed, below are the raw events from our system:

    {
      "id": "01GVG1MASGQHDWHZJQQN8XR4ZY",
      "type": "destroy",
      "status": "destroyed",
      "source": "flyd",
      "timestamp": "2023-03-14T12:30:56.816Z",
      "data": {}
    },
    {
      "id": "01GVG1M8N1FZMD8FV87767GZVX",
      "type": "destroy",
      "status": "destroying",
      "source": "user",
      "timestamp": "2023-03-14T12:30:54.625Z",
      "data": {}
    }

And the time between receiving the request to destroy the machine and it being set to destroyed is ~2 seconds. Are the backend server logs also from an App running on Fly.io?

Yup, that’s correct. I’ve verified that it’s not just a server timestamp clock drift issue because after the stop command was issued, the machine continued processing requests for another few min.

My guess is that it’s because I delete the Fly app instead of calling destroy on the Machine, and that deleting the app doesn’t immediately result in the machines being destroyed until later.

Should I be calling destroy on all the machines before I delete the app?

Ah, ok, yes this explains it. Destroying the app goes through our central API which does uses async jobs to process the destroying of resources associated with the App. If you destroy the machines first, the request goes through much quicker.

Thanks! I updated my service logic. When I call destroy on a machine and the request returns successfully, does that mean the resource has been destroyed successfully? Or do I need to call wait and block until the state actually changes, similar to what I need to do when i create a new machine?

Should deleting an app in flyctl destroy associated resources first? Or is the better solution to handle this lower in the stack?

You can use the wait endpoint to block on the machine being completely destroyed since the DELETE /v1/{app_name}/machines/{machine_id} endpoint returns once it starts the process.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.