I’ve been unable to deploy for the last hour or so. If I do a normal deploy, I get a perpetual Waiting for depot builder.... If I add --depot=false, I get
✓ Configuration is valid
--> Verified app config
==> Building image
WARN Failed to start remote builder heartbeat: failed to create volume: no capacity available in sjc
==> Building image
WARN Failed to start remote builder heartbeat: failed to create volume: no capacity available in sjc
Error: failed to fetch an image or build from source: failed to create volume: no capacity available in sjc (Request ID: 01JYC9JYMASKS183ZGKK264DRB-sjc) (Trace ID: c256e997d7450d478c367bf930999fd1)
I’ve also tried sea and that likewise seems to hang forever Waiting for depot builder...
Even after configuring it to deploy to sea, if I set --depot=false, I get the sjc capacity error: WARN Failed to start remote builder heartbeat: failed to create volume: no capacity available in sjc
even if you choose a different region, the legacy builder will try to deploy to your closest region. iirc there is a way to choose a different builder region but I don’t recall what it is off the top of my head
@lillian This topic comes up every week or so. The advice I have given a few times is that a capacity failure should be handled by the user, by falling back to a series of acceptable alternatives. I infer that from this text in the API docs:
[Machine creation] can fail, and you’re responsible for handling that failure. If you ask for a large Machine, or a Machine in a region we happen to be at capacity for, you might need to retry the request, or to fall back to another region. If you’re working directly with the Machines API, you’re taking some responsibility for your own orchestration!
Is this still correct advice, in your view? (I appreciate commands like scale can do this on the user’s behalf, but to apply the same advice, I wonder if readers should advocate that all Fly users configure additional regions. Based on the OP’s error, I assume they are set up to only deploy in sjc, and thus need to expand that list in order to be resilient.)
that is correct, yes. we are generally not at capacity in a region for an extended period of time (we’d put up a status page if that were the case); but it is possible users will hit capacity errors for short periods of time while we spin up burst capacity.
Super, thanks. I know we’ve talked about awareness campaigns here, and I wonder if this is a similar thing. I wonder if too few users are aware that a capacity error is something they can (and should) handle themselves. I think we’ve had cases where that knowledge gap has cost them unnecessary downtime, while they’re waiting for a forum/support reply.
@lillian how long is “an extended period of time”? From what I can tell, the capacity shortage lasted at least an hour.
A few things that would have helped me solve this myself:
Instead of pausing at “Waiting for builder indefinitely”, send an error about the capacity shortage and refer to docs on how to be prepared with fallback options
Add a warning in the legacy builder output when it’s building in sjc but the deploy command and/or fly.toml are requesting a different region.
It seemed like specifying a different primary region on the CLI didn’t override fly.toml, but that may have been confusion caused by --depot=false.
It wasn’t clear if the primary region could be changed without destroying and recreating the app
Yep, exactly that command @lillian; I’m using general CS terms, rather than aiming for Fly-specific terminology. I believe that if multiple regions are configured in that way for an app, this error would only fire if all regions are unavailable, and that the deploy command would try all of them, in order.
Please correct me if I’m wrong, as I don’t wish to misadvise.
I don’t believe it does that, no, it only tries the single region you specify in the command. the Machines API supports geographical regions such as us, eu, any, but afaik flyctl is not set up to accept those yet (and they would work only on the fly scale count --region X command).
I’m still pretty new to using fly (my account has been around a while, but only started dev in the last two weeks), so i’m still figuring out the intricacies. is fly scale the only way to bring up machines in other regions? or is there a way to do it via fly.toml?
Righto. There is a RESTful API that controls the whole thing. Anything you can do in flyctl you can do in the API, and per Lillian’s note, you can do more in the API.
Based on this conversation, I’d experiment with “update” next. If there’s a failure then you can retry in same region, or retry in a different region, as your fallback design dictates. If you have N running machines then you would do at least N updates calls, with more if you have any failures. (Designing one’s own deployment patterns is part of the fun of a platform like Fly).