Depot remote builders becoming the default

Hey, this was a bug in our provisioner code that interacts with the Fly machines API - we’ve pushed a fix to prevent that issue going forward.

If you’re curious about the tech details, this bug happened when launching a new Fly machine for your builder instance. Our provisioner performs a reconciliation loop of the following steps:

  1. List all machines and volumes to capture the “state of the world”
  2. Create / update / delete volumes as needed
  3. Create / update / delete machines as needed
  4. Loop

Sometimes though there’s a brief delay between a volume being created and the volume appearing in the volume list - when this happens, it’s possible for us to create two volumes with the same name. Our provisioner detects this case, and cleans up the unnecessary second volume.

However we had a bug where when the provisioner could incorrectly assign the volume it was about to clean up to the new machine it was about to launch. This would cause an error as it attempted to delete the volume, and the provisioner would destroy the Fly machine to restart the process from the top to resolve.

What that would look like to you is the build starting, then a few seconds later being canceled (CANCELED in the logs) as the machine terminated. Future builds would succeed, since now the cache data volume already existed.

The fix was making sure that our provisioner chooses the correct volume to mount with the new machine and doesn’t select the one that’s being cleaned up.

3 Likes

I upgraded from 0.3.6 to 0.3.7, and after that I was able to simply run fly deploy to push up my code. I can confirm the fix worked for me.

I’m confused, status page says it’s resolved? I’m on v0.3.8 but still getting timeouts with “Waiting for Depot builder”. Finally found this thread and re-ran with --depot false to get my hotfix out.

Hey, that’s frustrating given that we’ve said things are resolved! Very few builds are hitting this intermittent timeout. I see yours hit it - we’re looking into what caused it.

1 Like

I also only specify a builder and can confirm it’s working for me. Thanks!

I can see our organisation is using the depot builders, but we’re hit by an out of space issue. All prior community documentation on this issue was to destroy the builder app and start again, but there is no where to manage these now it appears, and I can’t see the builders using the cli either?

Extracting tar content of undefined failed, the file appears to be corrupt: "ENOSPC: no space left on device [...]"

I signed up for depot with the same email as fly thinking maybe I could see the machines over there but no (which makes sense given the comment that they are separated and only runs within fly). Even though the builder app links me over there, I have no visibility of these machines?

(had found a relevant comment here, Depot Remote Builder Support in Flyctl Is Now In Beta! - #34 by xHomu)

Depot machines don’t run inside your org - they’re fully managed by Depot. However, we’ll be introducing a CLI flag to allow you to clear your cache. Larger cache sizes will be available soon as well.

That said, it looks like your Dockerfile is using COPY --link. Using --link triggers an upstream bug in Buildkit which can fill up your cache quickly. If you remove --link from your Dockerfile, that will likely help.

Also, I’ve gone ahead and reset your cache on both of your organizations.

2 Likes

Yes, it works now. Thank you for the quick support and resolution.

You can now clear your build cache from the Fly dashboard: Clearing your builder cache from the Fly dashboard

Keep an eye out for more improvements to the builder system.

1 Like

The deployment step of our build pipeline started to time out after 1 hour without completing. It appears to have been making progress, but very slowly. This step previously took about 20 minutes. We disabled depot and we’re back to 20 minutes.

My builds are always about 1.5-2x slower than they were before. Every time for past few days

@stephen1 and @neil, can you contact support so we can take a closer look?

I already did, and responded to a request for more information. Currently waiting for another response. My account is under Enaia Inc., FYI.

It looks like you had a few builds that were running simultaneously, using up all the builder CPU. If you’re deploying multiple apps, you can use fly deploy --depot-scope app to ensure each apps’ builds are isolated to their own builder. Also, it looks like on some of your more recent builds, your build step is ~20m.

Honestly this seems pretty inadequate. The situation is:

  • With depot: our deploy step times out at 1 hour, 100% of the time [Update: not true - with depot enabled many of our deployments failed but best case, it succeeds but takes 4X as long for the deploy step. See my next post for details].
  • Without: always succeeds

FYI we do deployments in two situations:

  • We create staging apps when a PR is opened, and update them whenever updates are pushed to the branch associated with the PR
  • When a PR is merged to our main branch we kick off parallel deployments to our production server and to another server w/ test data with code that mirrors production.

The screenshots I sent via email to support happened to be from the second scenario but we also saw timeouts in the first case, when we only had a single build running.

Maybe --depot-scope would help in the second situation but we were seeing timeouts in the first case, when we only had a single build running. But also: shouldn’t --depot-scope be the default? As it is, you’ve made a behind the scenes change such that things that previously worked are now broken. We never had any issues with builds running simultaneously in the past.

The parallel build issue aside, the slowness / timeouts make depot unusable for us. I guess we’ll leave depot disabled and hope you don’t ever deprecate your own builders, however this from your original post indicates that this isn’t the plan:

Around that time, we’ll likely stop supporting the standard Fly remote builder.

So it sounds like we’ll need to spend time convincing you that there’s a real issue here.

1 Like

As a workaround, you can opt-in to run builds in your own CI/build machine (with flyctl deploy ... --local-only).

Hey, do you by chance have an example repo that reproduces the timeout? From what we can see on our end, the majority of the slow build time is attributed to RUN mix deps.compile, with it using all 8 machine CPUs during that time, though generally it’s taken 20-30 min to process that step and your recent builds from Tuesday seem to have succeeded.

If you are able to make a repo with the same dependencies as your app, I’m happy to profile the compile on my end.

With depot disabled, mix deps.compile took about 5:34 in our most recent run, and the whole deploy step took 9:12:

So, what you said about the time to run the deps.compile step is not accurate for our builds with depot disabled:

generally it’s taken 20-30 min to process that step

Here’s a run from when depot was enabled. mix.compile took 20:47 and the whole thing took 34:43:

This deployment did succeed since it completed in < 60 minutes. So, what I said earlier about it always failing was inaccurate - but even in this case where it worked a 4X slowdown and 4X increase in GitHub Actions minutes used, and a 4X greater time before we can deploy fixes to production is still not great.

The mix deps.compile step is the single longest step but even factoring that out, the other steps took 14 minutes on the depot run (34:43 - 20:47) vs. about 3:40 on our runs without depot (9:12 - 5:34) so I don’t think that profiling or optimizing our compile is going to be helpful. The whole thing is just much, much slower with depot. To be clear: we don’t have a goal of getting onto depot, so to be honest I don’t want to spend more time on this. I’d prefer to keep things the way they are, I’m just concerned since you’ve stated that you’re going to force everyone onto depot.

2 Likes

Is there a reason that the builder says “Depot” when deploying a pre-built image directly?

Screenshot from https://fly.io/apps/{app-name}/releases/{release-number}

Hey, just a general reply on what we’d like to see working with depot builders and the build process itself:

  1. We currently have to build locally (or use the legacy builders) because we can’t increase the memory allocation of depot builders.
  2. We’d like to split up our Dockerfile into multiple files (a frontend and backend Dockerfile for development, with a build target in the backend Dockerfile that copies the compiled frontend over for production). This doesn’t seem to be possible at the moment and we have to keep a copy of the frontend build steps in the backend Dockerfile.