On friday we started getting failing deployments in our Github PR workflow (without doing changes to configuration). I got a response from the fly staff that basicly said “try again”. And I did, this monday morning. And it worked! … For a few hours.
Now we are getting the same problem again, and this is serious since we cannot work effectivly in a critical phase.
Here is a snippet from the github workflow log
This deployment will:
* create 1 "app" machine
> Launching new machine
No machines in group app, launching a new machine
> Machine 185e053c449208 [app] was created
WARN failed to release lease for machine 185e053c449208: lease not found
✖ Failed: timeout reached waiting for health checks to pass for machine 185e053c449208: failed to get VM 185e053c449208: Get "https://api.machines.dev/v1/apps/klimsek-editor-fixes-288/machines/185e053c449208": net/http: request canceled
Error: timeout reached waiting for health checks to pass for machine 185e053c449208: failed to get VM 185e053c449208: Get "https://api.machines.dev/v1/apps/klimsek-editor-fixes-288/machines/185e053c449208": net/http: request canceled
Error: Process completed with exit code 1.
When I login on the dashboard I see that the app has now deployed successfully (without me doing anything).
From the fly deploy --help I find following
--lease-timeout string Time duration to lease individual machines while running deployment. All
machines are leased at the beginning and released at the end.The lease
is refreshed periodically for this same time, which is why it is
short.flyctl releases leases in most cases. (default "13s")
--wait-timeout string Time duration to wait for individual machines to transition states and
become healthy. (default "5m0s")
--release-command-timeout string Time duration to wait for a release command finish running, or 'none' to
disable. (default "5m0s")
--deploy-retries string Number of times to retry a deployment if it fails (default "auto")
I have currently not configured any of these (i.e. default values).
Do you have any ideas on what I could / try? Since it feels that it gets “stuck” I am not so sure that just extending a timeout would help (and which one, even?). It often goes fairly quick, if everything is well. Perhaps I should decrease the timeout and add 3 retries? What do you think?