Hey, this was a bug in our provisioner code that interacts with the Fly machines API - we’ve pushed a fix to prevent that issue going forward.
If you’re curious about the tech details, this bug happened when launching a new Fly machine for your builder instance. Our provisioner performs a reconciliation loop of the following steps:
List all machines and volumes to capture the “state of the world”
Create / update / delete volumes as needed
Create / update / delete machines as needed
Loop
Sometimes though there’s a brief delay between a volume being created and the volume appearing in the volume list - when this happens, it’s possible for us to create two volumes with the same name. Our provisioner detects this case, and cleans up the unnecessary second volume.
However we had a bug where when the provisioner could incorrectly assign the volume it was about to clean up to the new machine it was about to launch. This would cause an error as it attempted to delete the volume, and the provisioner would destroy the Fly machine to restart the process from the top to resolve.
What that would look like to you is the build starting, then a few seconds later being canceled (CANCELED in the logs) as the machine terminated. Future builds would succeed, since now the cache data volume already existed.
The fix was making sure that our provisioner chooses the correct volume to mount with the new machine and doesn’t select the one that’s being cleaned up.
I’m confused, status page says it’s resolved? I’m on v0.3.8 but still getting timeouts with “Waiting for Depot builder”. Finally found this thread and re-ran with --depot false to get my hotfix out.
Hey, that’s frustrating given that we’ve said things are resolved! Very few builds are hitting this intermittent timeout. I see yours hit it - we’re looking into what caused it.
I can see our organisation is using the depot builders, but we’re hit by an out of space issue. All prior community documentation on this issue was to destroy the builder app and start again, but there is no where to manage these now it appears, and I can’t see the builders using the cli either?
Extracting tar content of undefined failed, the file appears to be corrupt: "ENOSPC: no space left on device [...]"
I signed up for depot with the same email as fly thinking maybe I could see the machines over there but no (which makes sense given the comment that they are separated and only runs within fly). Even though the builder app links me over there, I have no visibility of these machines?
Depot machines don’t run inside your org - they’re fully managed by Depot. However, we’ll be introducing a CLI flag to allow you to clear your cache. Larger cache sizes will be available soon as well.
That said, it looks like your Dockerfile is using COPY --link. Using --link triggers an upstream bug in Buildkit which can fill up your cache quickly. If you remove --link from your Dockerfile, that will likely help.
Also, I’ve gone ahead and reset your cache on both of your organizations.
The deployment step of our build pipeline started to time out after 1 hour without completing. It appears to have been making progress, but very slowly. This step previously took about 20 minutes. We disabled depot and we’re back to 20 minutes.
It looks like you had a few builds that were running simultaneously, using up all the builder CPU. If you’re deploying multiple apps, you can use fly deploy --depot-scope app to ensure each apps’ builds are isolated to their own builder. Also, it looks like on some of your more recent builds, your build step is ~20m.
Honestly this seems pretty inadequate. The situation is:
With depot: our deploy step times out at 1 hour, 100% of the time [Update: not true - with depot enabled many of our deployments failed but best case, it succeeds but takes 4X as long for the deploy step. See my next post for details].
Without: always succeeds
FYI we do deployments in two situations:
We create staging apps when a PR is opened, and update them whenever updates are pushed to the branch associated with the PR
When a PR is merged to our main branch we kick off parallel deployments to our production server and to another server w/ test data with code that mirrors production.
The screenshots I sent via email to support happened to be from the second scenario but we also saw timeouts in the first case, when we only had a single build running.
Maybe --depot-scope would help in the second situation but we were seeing timeouts in the first case, when we only had a single build running. But also: shouldn’t --depot-scope be the default? As it is, you’ve made a behind the scenes change such that things that previously worked are now broken. We never had any issues with builds running simultaneously in the past.
The parallel build issue aside, the slowness / timeouts make depot unusable for us. I guess we’ll leave depot disabled and hope you don’t ever deprecate your own builders, however this from your original post indicates that this isn’t the plan:
Around that time, we’ll likely stop supporting the standard Fly remote builder.
So it sounds like we’ll need to spend time convincing you that there’s a real issue here.
Hey, do you by chance have an example repo that reproduces the timeout? From what we can see on our end, the majority of the slow build time is attributed to RUN mix deps.compile, with it using all 8 machine CPUs during that time, though generally it’s taken 20-30 min to process that step and your recent builds from Tuesday seem to have succeeded.
If you are able to make a repo with the same dependencies as your app, I’m happy to profile the compile on my end.
This deployment did succeed since it completed in < 60 minutes. So, what I said earlier about it always failing was inaccurate - but even in this case where it worked a 4X slowdown and 4X increase in GitHub Actions minutes used, and a 4X greater time before we can deploy fixes to production is still not great.
The mix deps.compile step is the single longest step but even factoring that out, the other steps took 14 minutes on the depot run (34:43 - 20:47) vs. about 3:40 on our runs without depot (9:12 - 5:34) so I don’t think that profiling or optimizing our compile is going to be helpful. The whole thing is just much, much slower with depot. To be clear: we don’t have a goal of getting onto depot, so to be honest I don’t want to spend more time on this. I’d prefer to keep things the way they are, I’m just concerned since you’ve stated that you’re going to force everyone onto depot.