no capacity available in sjc

I’ve been unable to deploy for the last hour or so. If I do a normal deploy, I get a perpetual Waiting for depot builder.... If I add --depot=false, I get

✓ Configuration is valid
--> Verified app config
==> Building image
WARN Failed to start remote builder heartbeat: failed to create volume: no capacity available in sjc

==> Building image
WARN Failed to start remote builder heartbeat: failed to create volume: no capacity available in sjc

Error: failed to fetch an image or build from source: failed to create volume: no capacity available in sjc (Request ID: 01JYC9JYMASKS183ZGKK264DRB-sjc) (Trace ID: c256e997d7450d478c367bf930999fd1)

I’ve also tried sea and that likewise seems to hang forever Waiting for depot builder...

Even after configuring it to deploy to sea, if I set --depot=false, I get the sjc capacity error: WARN Failed to start remote builder heartbeat: failed to create volume: no capacity available in sjc

Looks like maybe it’s recovered. A status page update would’ve been helpful.

sorry for the trouble, I just added more capacity in sjc.

Thanks! Although, I realized my test is actually still deploying to sea, not sjc. Seems like there were issues in both regions?

sjc also working

even if you choose a different region, the legacy builder will try to deploy to your closest region. iirc there is a way to choose a different builder region but I don’t recall what it is off the top of my head

@lillian This topic comes up every week or so. The advice I have given a few times is that a capacity failure should be handled by the user, by falling back to a series of acceptable alternatives. I infer that from this text in the API docs:

[Machine creation] can fail, and you’re responsible for handling that failure. If you ask for a large Machine, or a Machine in a region we happen to be at capacity for, you might need to retry the request, or to fall back to another region. If you’re working directly with the Machines API, you’re taking some responsibility for your own orchestration!

Is this still correct advice, in your view? (I appreciate commands like scale can do this on the user’s behalf, but to apply the same advice, I wonder if readers should advocate that all Fly users configure additional regions. Based on the OP’s error, I assume they are set up to only deploy in sjc, and thus need to expand that list in order to be resilient.)

that is correct, yes. we are generally not at capacity in a region for an extended period of time (we’d put up a status page if that were the case); but it is possible users will hit capacity errors for short periods of time while we spin up burst capacity.

1 Like

Super, thanks. I know we’ve talked about awareness campaigns here, and I wonder if this is a similar thing. I wonder if too few users are aware that a capacity error is something they can (and should) handle themselves. I think we’ve had cases where that knowledge gap has cost them unnecessary downtime, while they’re waiting for a forum/support reply.

1 Like

@lillian how long is “an extended period of time”? From what I can tell, the capacity shortage lasted at least an hour.

A few things that would have helped me solve this myself:

  • Instead of pausing at “Waiting for builder indefinitely”, send an error about the capacity shortage and refer to docs on how to be prepared with fallback options
  • Add a warning in the legacy builder output when it’s building in sjc but the deploy command and/or fly.toml are requesting a different region.
  • It seemed like specifying a different primary region on the CLI didn’t override fly.toml, but that may have been confusion caused by --depot=false.
  • It wasn’t clear if the primary region could be changed without destroying and recreating the app

It can be a few hours, depending on provider availability (and staff availability, especially on weekends).

sorry, I’m not as familiar with the PaaS side of things so I can’t comment on the other issues/suggestions.

I like Ian’s suggestion. Could this error:

no capacity available in sjc

be rendered like so?

no capacity available in sjc (please configure additional fallback regions)

It would work for multiple region failures too:

no capacity available in sjc, sea (please configure additional fallback regions)

Basically, the deploying engineer should never see this, unless many regions are down simultaneously.

good idea! I’ve thrown it up internally in case anyone has time to look at it.

1 Like

It would also be good to point to docs. I couldn’t find any docs around configuring fallback regions.

“fallback regions” aren’t a Fly concept. I assume halfer refers to creating machines in other regions - fly scale count N --region lax for instance.

Yep, exactly that command @lillian; I’m using general CS terms, rather than aiming for Fly-specific terminology. I believe that if multiple regions are configured in that way for an app, this error would only fire if all regions are unavailable, and that the deploy command would try all of them, in order.

Please correct me if I’m wrong, as I don’t wish to misadvise.

I don’t believe it does that, no, it only tries the single region you specify in the command. the Machines API supports geographical regions such as us, eu, any, but afaik flyctl is not set up to accept those yet (and they would work only on the fly scale count --region X command).

Ah, gotcha. Perhaps @ianwremmel you could deploy via the API? I wonder if deploying something to a nearby region is better than not deploying at all!

Yes, but for clarity for the general reader, I believe this command can also take a list of regions, like so: --region yyz,ewr.

you could deploy via the API?

maybe? I’m not quite sure what you mean by that :slight_smile:

I’m still pretty new to using fly (my account has been around a while, but only started dev in the last two weeks), so i’m still figuring out the intricacies. is fly scale the only way to bring up machines in other regions? or is there a way to do it via fly.toml?

Righto. There is a RESTful API that controls the whole thing. Anything you can do in flyctl you can do in the API, and per Lillian’s note, you can do more in the API.

Based on this conversation, I’d experiment with “update” next. If there’s a failure then you can retry in same region, or retry in a different region, as your fallback design dictates. If you have N running machines then you would do at least N updates calls, with more if you have any failures. (Designing one’s own deployment patterns is part of the fun of a platform like Fly).