Machines entering uncoverable state during deployments

ctcb · December 4, 2024, 6:58pm

I want to preface by saying that we know there are issues in certain regions right now. However, we are in IAD.

Around 11AM EST our CI started deploying to our staging environment. It took over 45 minutes to deploy. During the deployment, machines get stuck in loops and timed out.

> [2/2] Updating machine config for [REDACTED]
> [2/2] Updating [REDACTED] [app]
> [2/2] Updated machine config for [REDACTED]
> [2/2] Waiting for machine [REDACTED] to reach a good state
> [2/2] Machine [REDACTED] reached stopped state
> [2/2] Machine [REDACTED] reached started state
> [2/2] Machine [REDACTED] reached suspended state
Failed to update machines: failed to update machine [REDACTED]: timeout reached waiting for machine's state to change Retrying...
✖ [2/2] timeout reached waiting for machine's state to change
> [1/2] Updating machine config for [REDACTED]
> [1/2] Updating [REDACTED] [app]
> [1/2] Updated machine config for [REDACTED]
> [1/2] Waiting for machine [REDACTED] to reach a good state
> [1/2] Machine [REDACTED] reached suspended state
> [1/2] Machine [REDACTED] reached started state
> [1/2] Machine [REDACTED] reached stopped state
✖ [1/2] timeout reached waiting for machine's state to change
> [2/2] Updating machine config for [REDACTED]
> [2/2] Updating [REDACTED] [app]
> [2/2] Updated machine config for [REDACTED]
> [2/2] Waiting for machine [REDACTED] to reach a good state
> [2/2] Machine [REDACTED] reached stopped state
> [2/2] Machine [REDACTED] reached started state
> [2/2] Machine [REDACTED] reached suspended state
> Failed to update machines: failed to update machine [REDACTED]: timeout reached waiting for machine's state to change Retrying...

We attempted another deployment about an hour after and that deployment never resolved after running for over an hour. When we looked at the machines in the UI, we’d see one machine in a bad state.

We stopped it, deleted it, and scaled back up. This didn’t help.

We’ve tried re-deploying but the same behaviors occur. Notably we didn’t get these issues in our demo and production environments after the 11AM deployment, but we also haven’t attempted to deploy those environments since (our CI deploys to staging, then demo, then prod).

This is an issue that we used to encounter weekly in this staging environment, and we we haven’t encountered this behavior for several months.

Any ideas?

ctcb · December 4, 2024, 9:09pm

I posted in another thread about this, but maybe relevant here as well. The Machines API looks like its having problems across most regions according to https://rtt.fly.dev . Most regions aren’t even displaying a response time for me.

Topic		Replies	Views
fly deploy machine timeout during release command elixir	3	895	August 11, 2023
Error: timed out waiting for machine to reach started state - Node.js app JavaScript postgres , nodejs	9	1108	November 5, 2023
deploy times out Build debugging rails , machines	4	49	May 4, 2025
Machine stuck in replacing state Build debugging	49	2063	February 21, 2025
Cannot deploy app since yesterday	15	961	August 16, 2023

Machines entering uncoverable state during deployments

Related topics