Application completely bricked, can't scale, clone, redeploy.

danwetherald · April 3, 2024, 2:41pm

App
  Name     = better-cart-monster-bash-checkout-redis-prod
  Owner    = better-cart
  Hostname = better-cart-monster-bash-checkout-redis-prod.fly.dev
  Image    = -

Found machines that aren't part of Fly Launch, run fly machines list to see them.

fly apps restart does nothing. The web GUI fails to load most resources.

fly scale count 2:

Error: failed to launch VM: unable to use requested volume, 'vol_2n0l9vly5md4635d' due to capacity constraints
failed to launch VM: server returned a non-200 status code: 500 (Request ID: 01HTJ4C1RR03KQVQET8KASRQ25-chi)

fly deploy / fly machines clone:

Error: error creating a new machine: failed to launch VM: unable to use requested volume, 'vol_2n0l9vly5md4635d' due to capacity constraints (Request ID: 01HTJ66VA708NJSX3PSBJ36VDB-chi)

We are completely stuck with no way to get this application back up and the outage has now been active for over 24 hours.

john-fly · April 3, 2024, 4:13pm

Hi Dan,

This must be frustrating, it looks like your situation is an interaction between several different components. Most fundamentally, I can see that the server which hosted your original Fly Machine and Fly Volumes is actually very toasty right now and will not be coming back in service soon if ever.

Unfortunately you have two Fly Volumes on this host but only one Fly Machine, which means when you try to create a Machine in ORD, our scheduler will try to place the Machine on the host with the unattached Volume. But that host is dead, so you get failure.

The short-term solution to this is to bring up machines in a different region. I see that, at least now, you do have a Machine and a Volume now in IAD; hopefully that’s working for you.

The long-term solution is to force-destroy the Machines and Volumes in ORD, and then you can deploy to that region again. We actually might have a bug in our API at the moment which does not allow the forced deletion of resources on a totally broken host, but we will fix this problem.

If you have been billed for any of these inaccessible resources, please email billing@fly.io for a refund.

Hope this is helpful; I’ll be watching this thread to follow up with any other questions you might have.

danwetherald · April 3, 2024, 4:20pm

Hi @john-fly - thanks for your reply.

Yes, an hour or so ago we tried moving to IAD - this seems to have half worked, the app still appears to be having issues within the fly system. We are not able to delete the ORD machines, is this something you can do? There is still a lot of issues trying to load and restart the apps, etc.

It appears we are not able to destroy, kill, stop, etc this machine 148e272fe473d8 as it just 408’s - seems the machine only half exists in fly.

We are still not able to connect to this self hosted redis app from other apps.

Thanks,

john-fly · April 3, 2024, 4:25pm

As for deleting the ORD resources, that looks like a bug in our system. We will look into that.

For whatever problems you’re having with IAD, can you give more information about what you’re seeing and how you’re connecting your better-cart-monster-bash-checkout-redis-prod app to everything else? I don’t have enough to go on with what you’ve said so far.

danwetherald · April 3, 2024, 4:28pm

We are getting connection timeouts to:

redis://:REDIS_PASSWORD@better-cart-monster-bash-checkout-redis-prod.internal:6379

Is this still the proper way to use internal domain routing?

Also, looks like your Docs site is down, getting nginx errors.

john-fly · April 3, 2024, 4:31pm

Can you flyctl ssh console into any of your apps and dig or otherwise query better-cart-monster-bash-checkout-redis-prod.internal?

danwetherald · April 3, 2024, 4:34pm

As I expected, traffic is still being routed to the bad machine in ORD.

PING better-cart-monster-bash-checkout-redis-prod.internal(148e272fe473d8.vm.better-cart-monster-bash-checkout-redis-prod.internal (fdaa:0:190:a7b:20dc:1258:5e7c:2)) 56 data bytes

This is why I would like to delete this machine.

john-fly · April 3, 2024, 4:41pm

Okay, fixing this force-destroy bug just became higher priority, but we’re not going to have a fix ready in a few minutes.

For an immediate fix, try a url that will only direct traffic to the known good host in the new region: iad.better-cart-monster-bash-checkout-redis-prod.internal.

danwetherald · April 3, 2024, 4:49pm

Thanks, yes, the ping works properly when specifying a specific region.

Once we are able to delete this bad ORD machine, what would be the best steps to get this app back in the ORD region? I am assuming a new deployment will not choose this old/bad host.

john-fly · April 3, 2024, 5:01pm

Correct, the Fly Platform is not trying to place other workloads on this broken server; the only reason it is trying to do so in your case is that it sees you have an unattached volume there (this logic should clearly get changed, but that’s what’s happening at the moment). As soon as that’s deleted, deploys will work normally in ORD.

As an aside, I took a look at your organization’s spending and you’re on a free plan, but you’re spending enough to upgrade the Launch plan without cost. None of our plans cost anything in themselves, they’re just various levels of minimum spend. If you upgraded your better-cart account to the Launch plan, you would have a custom support email you could have used to get a faster response for this problem than waiting for a backend engineer to wander through the forums. And it won’t cost you anything extra, because you’re already well above the minimum spend for the Launch plan.

danwetherald · April 3, 2024, 7:10pm

Sounds good, can you let us know when we can destroy the dead ORD machine?

Thanks!

john-fly · April 3, 2024, 7:12pm

Of course.

system · April 10, 2024, 7:13pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
insufficient resources to create new machine with existing volume Build debugging machines , volumes	5	378	August 16, 2024
Failed to start remote builder Questions / Help	5	587	January 13, 2023
Error creating more machines using flyctl scale count	6	347	November 12, 2023
Failed to start remote builder heartbeat: Couldn't allocate volume, no disks available Build debugging	6	738	January 13, 2023
Unable to scale (even to zero), list volumes ("failed to list volumes") Questions / Help volumes	2	257	December 27, 2023

Application completely bricked, can't scale, clone, redeploy.

Related topics