I want to preface by saying that we know there are issues in certain regions right now. However, we are in IAD.
Around 11AM EST our CI started deploying to our staging environment. It took over 45 minutes to deploy. During the deployment, machines get stuck in loops and timed out.
> [2/2] Updating machine config for [REDACTED]
> [2/2] Updating [REDACTED] [app]
> [2/2] Updated machine config for [REDACTED]
> [2/2] Waiting for machine [REDACTED] to reach a good state
> [2/2] Machine [REDACTED] reached stopped state
> [2/2] Machine [REDACTED] reached started state
> [2/2] Machine [REDACTED] reached suspended state
Failed to update machines: failed to update machine [REDACTED]: timeout reached waiting for machine's state to change Retrying...
✖ [2/2] timeout reached waiting for machine's state to change
> [1/2] Updating machine config for [REDACTED]
> [1/2] Updating [REDACTED] [app]
> [1/2] Updated machine config for [REDACTED]
> [1/2] Waiting for machine [REDACTED] to reach a good state
> [1/2] Machine [REDACTED] reached suspended state
> [1/2] Machine [REDACTED] reached started state
> [1/2] Machine [REDACTED] reached stopped state
✖ [1/2] timeout reached waiting for machine's state to change
> [2/2] Updating machine config for [REDACTED]
> [2/2] Updating [REDACTED] [app]
> [2/2] Updated machine config for [REDACTED]
> [2/2] Waiting for machine [REDACTED] to reach a good state
> [2/2] Machine [REDACTED] reached stopped state
> [2/2] Machine [REDACTED] reached started state
> [2/2] Machine [REDACTED] reached suspended state
> Failed to update machines: failed to update machine [REDACTED]: timeout reached waiting for machine's state to change Retrying...
We attempted another deployment about an hour after and that deployment never resolved after running for over an hour. When we looked at the machines in the UI, we’d see one machine in a bad state.
We stopped it, deleted it, and scaled back up. This didn’t help.
We’ve tried re-deploying but the same behaviors occur. Notably we didn’t get these issues in our demo and production environments after the 11AM deployment, but we also haven’t attempted to deploy those environments since (our CI deploys to staging, then demo, then prod).
This is an issue that we used to encounter weekly in this staging environment, and we we haven’t encountered this behavior for several months.
Any ideas?