Deployment hangs forever

Hi! One of my apps (https://reflame-resource.fly.dev/, and thankfully only that one) is getting completely stuck after deployment (at the point when monitoring starts, everything up to that point seems to finish normally) and never updates to the latest version.

Dashboard looks something like this:

I was able to recover the last time this happened a few days ago by scaling to 0 and then back to 3, but the issue came back, so I’m leaving it in this state for now to help with investigations.

Would be awesome if someone to help take a look at what’s going on. Thanks!

PS. It’d be nice if the fly github action could have some default timeout for monitoring, and fail the job if the timeout passes. Currently the only thing notifying me of the issue is a subsequent deploy action canceling the previous. In the mean time the action could have been sitting there for potentially hours eating up a ton of build minutes.

Hi @lewis

Regarding the github actions, you can use the timeout-minutes config on a job to ensure it’s cancelled if it takes too long so it doesn’t eat up hours of your build minutes.

Hey @lewis can you run fly logs and paste the result

@rahmatjunaid So I just opened fly logs and triggered another deploy, and no new logs showed up at all.

Yep, that’s what I ended up doing, but I mentioned this because it sounds like a good idea for fly to have some default timeout for monitoring deploys. If an app doesn’t get deployed in over 10 minutes (or possibly even 5?), something’s probably gone horribly wrong, like it did here.

You can add this flag after deploy command:

--detach                Return immediately instead of monitoring deployment progress

This looks like it’s doing a canary deploy. But you have 3 regions set, and a max per region of 1. Which means it can’t boot the new instance (because all three regions are in use).

This is a horrible sharp edge in our system and we’re working hard to replace that plumbing. If it gets in this state again, you can run fly vm stop <id> on any existing VM and it’ll unstick the deploy.

Your GitHub action should probably include deploy --strategy rolling if it doesn’t already. We try to prevent canary deploys for apps with this config but not all instances get caught.

Ah interesting, I do remember setting it at some point, but for some reason my fly scale show looked like this so I didn’t think to dig further:

> fly scale show
VM Resources for reflame-resource
        VM Size: shared-cpu-1x
      VM Memory: 256 MB
          Count: 3
 Max Per Region: Not set

Could that be a bug in the CLI?

I just ran fly scale count 3 --max-per-region -1 and reran the build. That seemed to update the app to the latest version.

But when I ran a new deploy with fly deploy strategy --bluegreen, it started hanging again in the exact same way. Did the --max-per-region -1 not apply correctly?

Yes I’m aware of that flag, but I do still want to monitor to wait for successful deploys to complete and get alerted if there’s a timeout.

@kurt I still need some help with this, as my deploys are still hanging until I manually scale to 0 and back.

As per my previous reply, I’ve set --max-per-region -1, that means --strategy bluegreen should work, right?

Could it be not applying correctly?

Hey Lewis, the reason the deployment is hanging is because you’re using both bluegreen and --max-per-region at the same time. Bluegreen is trying to launch a bunch of vms before tearing down the old ones but the max-per-region limits how many vms can be in one region.

To use --max-per-region -1 you need to have --strategy rolling, can you give that ago ?

Hi @rahmatjunaid, I understand --max-per-region is not compatible with --strategy bluegreen. However, what I’m trying to do is turn off --max-per-region so I can use --strategy bluegreen again.

My understanding was that --max-per-region -1 should accomplish that, looking at the implementation where it defaults to -1.

If that’s not correct, what is the correct way to disable --max-per-region so I can start using --strategy bluegreen again?

FWIW, fly scale show shows Max Per Region as Not set:

> fly scale show
VM Resources for reflame-resource
        VM Size: shared-cpu-1x
      VM Memory: 256 MB
          Count: 3
 Max Per Region: Not set

But I think there’s a bug there so not sure if I can trust that: MaxPerRegion is always "Not Set" · Issue #596 · superfly/flyctl · GitHub

2 Likes