Deployment started failing with "failed to release lease for machine"

On friday we started getting failing deployments in our Github PR workflow (without doing changes to configuration). I got a response from the fly staff that basicly said “try again”. And I did, this monday morning. And it worked! … For a few hours.

Now we are getting the same problem again, and this is serious since we cannot work effectivly in a critical phase.

Here is a snippet from the github workflow log

This deployment will:
 * create 1 "app" machine
> Launching new machine
No machines in group app, launching a new machine
> Machine 185e053c449208 [app] was created
WARN failed to release lease for machine 185e053c449208: lease not found
✖ Failed: timeout reached waiting for health checks to pass for machine 185e053c449208: failed to get VM 185e053c449208: Get "https://api.machines.dev/v1/apps/klimsek-editor-fixes-288/machines/185e053c449208": net/http: request canceled
Error: timeout reached waiting for health checks to pass for machine 185e053c449208: failed to get VM 185e053c449208: Get "https://api.machines.dev/v1/apps/klimsek-editor-fixes-288/machines/185e053c449208": net/http: request canceled
Error: Process completed with exit code 1.

Grateful for any insight / suggestions

Re-triggered and now it worked. I have no idea what is happening.

And now the problems are back again…

When I login on the dashboard I see that the app has now deployed successfully (without me doing anything).

From the fly deploy --help I find following

     --lease-timeout string             Time duration to lease individual machines while running deployment. All
                                         machines are leased at the beginning and released at the end.The lease
                                         is refreshed periodically for this same time, which is why it is
                                         short.flyctl releases leases in most cases. (default "13s")

      --wait-timeout string              Time duration to wait for individual machines to transition states and
                                         become healthy. (default "5m0s")


      --release-command-timeout string   Time duration to wait for a release command finish running, or 'none' to
                                         disable. (default "5m0s")

      --deploy-retries string            Number of times to retry a deployment if it fails (default "auto")

I have currently not configured any of these (i.e. default values).

Do you have any ideas on what I could / try? Since it feels that it gets “stuck” I am not so sure that just extending a timeout would help (and which one, even?). It often goes fairly quick, if everything is well. Perhaps I should decrease the timeout and add 3 retries? What do you think?

From App not working to Questions / Help

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.