App crashed, won't restart or deploy (production app)

Hi, I have an app down which won’t restart using flyctl restart. I’m watching the logs as I issue the restart command, but the logs don’t change. I can see the app is now “pending” in the app dashboard, but nothing is progressing.

I’ve tried deploying also, but I get a weird message from flyctl:

Waiting for remote builder fly-builder-long-bird-1694... connecting ⣟
Error error connecting to docker: error establishing wireguard connection for [~redacted~] organization: error fetching dialer: establish failed: err err handling establish: no such organization

Weirdly, the organisation says it has a credit value of $-0.03. I’m not sure whether this could be blocking the deploy?

Anyway, this is a production app, so I need to get it back up ASAP. Can someone from @fly.io help?

Hey, sorry for the trouble. Which region is this running in?

LHR, but it seems to have come back up now. Not sure if that was someone nudging it, or whether a “blockage” resolved itself, but either way we’re back up! Yay! :grin:

LHR is having some capacity problems today due to a flurry of deployments there. You probably got lucky when another customer migrated away from LHR just now, for the same reason. If you can, it might be smart to migrate to FRA which is less loaded up.

1 Like

Also, could you let me know if you have any volumes associated with your app? And if you have any backups regions set? This should not have happened if either of those were true.

The app that wouldn’t start doesn’t have any volumes or backup regions. I’ll add a backup region now.

I had the same. App in LHR. Was down at the same time a few hours ago.

I checked flyctl --app name status and the two instances are showing as created 3h44m ago. I haven’t touched the app today so those must have been re-deployed by fly as the prior ones were much older.

The status page shows no incidents reported today but perhaps you are still investigating or don’t want to publicise it yet.

I have two instances in the one region as per advice: Again issue Deploying to MAA Region - #10 by kurt Though that hasn’t helped here :slight_smile:

I thought I had backup regions set. And that seems to still be the case:

$ flyctl --app name regions list

Region Pool:
lhr

Backup Region:
cdg

The app has no volumes.

I guess that means the backup region did not kick in? Is there any way of seeing in the logs if it did (I’m assuming it did not based on the app being down)?

I checked flyctl --app name releases and that only shows mine from a week ago. Likewise flyctl --app name history. Same. I don’t know if those would include deploys/releases to a backup region and so missing data there means it did not happen or would not be listed even if it did. But it certainly looks like not, especially since the regions of my two instances show as lhr now, and I would assume they would be cdg. Which is disappointing.

I’d assumed that Fly would not let new apps deploy to a region if capacity was low. Or at least somewhere within a critical threshold. So that there was sufficient capacity for existing apps in the region to stay up.

Any way to avoid this on the user side going forward?

When we reach capacity in different regions, we evict shared-cpu-1x VMs to make room for larger ones. This makes sense when you have backup regions specified, but is bad when your app is single region. We’re adjusting that logic to keep shared-cpu-1x VMs that don’t have backup regions set around longer.

This is only the second time our system has ever had to evict VMs in a given region so hopefully you never experience it again. :slight_smile:

Thanks @kurt

I wasn’t aware of that condition of shared CPU VMs being booted. Is that documented? I don’t see that on the pricing page. I get that you are paying for a slice of a CPU, but figured that slice would be all the time. Not a slice of a CPU only if one is available.

Thinking about it, that would also impact auto-scaling since there would be no capacity to auto-scale into. So even if more instances were needed, they wouldn’t be created. And requests would fail. Hmm.

It’s good they are going to be kept longer but it’s a concern if you have to hope another client gives in first and deletes/moving their app in order to free up capacity (as it sounds like happened here by a customer migrating away). Whether it will happen for a third time … I guess that is the question. It’s a nice problem for you to have. You must be growing faster than you can buy more servers :slight_smile:

Regarding backup regions … I was going to remove that cdg as per your recommendation App starting in backup region - #2 by kurt but forgot and left it. Though it appears not to have helped in this case (which seems the perfect case for a backup region) would your advice now be to have backup regions … or not?

Thanks.

It’s more accurate to say they get rescheduled in regions where there is capacity. This doesn’t work for apps that are pinned to a single region. They don’t get turned off and left that way, though.

We just pushed a change that will prevent single region VMs from being evicted, since they can’t be rescheduled. I don’t think you need to worry about it now.

Ah, nice.

So is it safer to remove all backup regions now? To make apps single-region?

The issue of auto-scaling would still remain though. Since even if the current instance(s) kept existing capacity and not booted, if the region is full, it would not let an app scale. Is there any way to know capacity? I get you can’t give absolute instance/server numbers but are any regions, er, bigger than others in terms of “likely have capacity here”? Or does it change so often you can’t really use that as a factor?

Will they eventually be rescheduled back to their primary region?

My app is targeted to LHR because it serves the UK only. It’s fine to be rescheduled nearby in Europe if there’s an issue, but I would prefer it to generally be hosted in it’s primary location, the UK.

Yes they’ll migrate back over time!