"no space left on device" when remote building

I’ve been using a remote builder successfully for many builds all day. Suddenly, my build starts failing with:

Error error building: error rendering build status stream: open /data/docker/...: no space left on device

Do you have some kind of space leak maybe?

1 Like

We have logic in place that should clear space in the event that too much has been allocated. It seems like it was unable to reclaim space in this case? Hmm.

Every organization has its own builder. They come with a 50GB volume.

The easiest way to fix the issue is to destroy your builder. Find the name of your builder for the organization that owns the app you’re building by looking for an app that starts with fly-builder- and destroy it with flyctl apps destroy fly-builder-xyz. The next build will create a new builder with a fresh volume.

The harder way to is start a build, let it fail and flyctl ssh console -a fly-builder-xyz, and docker system prune --all --volumes --force to delete everything. To be honest, it’s probably not worth it to do that. I’d be curious to see what’s taking up so much space though.

Builds have been failing consistently or just the one?

5 Likes

I tried to build several times and it failed with out of space errors at different point in the build every time. I deleted the builder and now it’s working.

1 Like

Just hit this. Will destroy builder and see if it helps.
image

My new builder got created but it has no volume and it’s the minimum spec machine possible.

Error failed to fetch an image or build from source: error connecting to docker: Mounts source volume "vol_" does not exist

Ok ignore me, the volume eventually appeared :slight_smile:

I just ran into this, and I’m wondering if maybe the cleanup doesn’t happen if a build is cancelled mid-build?

I’ve set my app to auto-deploy when there’s a new commit to the main branch, when happens when dependabot updates each dependency.

To avoid actually deploying multiple times, I’ve made Github Actions cancel an existing action if there’s one running.

Could that be causing an issue with the cleanup in the volume? (just thinking out loud, might be completely wrong here!)

@jerome
I have been using fly for nearly 8 months now, and this has happened me consistently. Its almost at the point where I can predict when i need to destroy my remote builder.

Whatever mechanism you guys have in place to clean up the space is not working I know that much :smiley:

Its one of those issues that is only minorly inconvenient, so it will probably not be a priority to fix - but fingers crossed.

1 Like

Good news! We made some changes to the purge logic yesterday. Delete your current builder app and you’ll get a replacement with fixes.

2 Likes

Wow what great timing! Thanks!

Okay so this issue has since happened me 3 times with fresh remote builders.

I am going to be leaving Fly for the time being, I just dont see this service as production ready. Just sticking to Kubernetes, at least I can fix issues with infrastructure myself in that case.

@michael any chance you could shed light on what the current purge logic is?
We likewise see our remote builder instance storage just monotonically increasing indefinitely and never reclaiming stale build volumes.

We increased the storage limit of our builder instance but that is seemingly just pushing out the inevitable force destroy timeline, because the auto-purge logic never seems to kick in for us.

Here’s the block of code that decides when to prune: rchab/main.go at main · superfly/rchab · GitHub

1 Like

Got it, thanks for sharing.

So by default the current thresholds are:

  • 80% volume full OR ~15GB remaining (15 * 1000 * 1000 * 1000 bytes)

Unfortunate caveat to note (maybe not too surprising but disruptive to daily build operations eg with a CI setup > when cleanup threshold is reached), when the builder volume is undergoing cleanup > a build attempt that gets queued up during this time will fail, eg error logs from cleanup happening mid build attempt:

2023-03-14T08:14:55.180 app[app_id] dfw [info] Error: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

2023-03-14T08:14:55.190 app[app_id] dfw [info] time="2023-03-14T08:14:55.190361280Z" level=debug msg="Checking dockerd healthyness"

2023-03-14T08:14:55.191 app[app_id] dfw [info] time="2023-03-14T08:14:55.190449151Z" level=error msg="error waiting on docker: signal: killed"

2023-03-14T08:14:55.191 app[app_id] dfw [info] time="2023-03-14T08:14:55.190898231Z" level=info msg="dockerd has exited"

2023-03-14T08:14:55.191 app[app_id] dfw [info] time="2023-03-14T08:14:55.191110991Z" level=fatal msg="dockerd exited before we could ascertain its healthyness"

2023-03-14T08:14:55.707 app[app_id] dfw [info] Starting clean up.

2023-03-14T08:14:55.707 app[app_id] dfw [info] Umounting /dev/vdb from /data

2023-03-14T08:14:55.707 app[app_id] dfw [info] error umounting /data: EBUSY: Device or resource busy, retrying in a bit

2023-03-14T08:14:56.458 app[app_id] dfw [info] error umounting /data: EBUSY: Device or resource busy, retrying in a bit

2023-03-14T08:14:57.210 app[app_id] dfw [info] error umounting /data: EBUSY: Device or resource busy, retrying in a bit

2023-03-14T08:14:57.961 app[app_id] dfw [info] error umounting /data: EBUSY: Device or resource busy, retrying in a bit

2023-03-14T08:14:59.719 app[app_id] dfw [info] [ 24.362316] reboot: Restarting system

2023-03-14T08:15:01.323 runner[app_id] dfw [info] machine has reached its max restart count (10)

Unfortunately it looks like we are likewise witnessing some sort of strange OOM crash loop when the automatic cleanup process does ultimately kick in (for a builder volume with 8GB of memory > and OOM happens just while the instance is restarting as part of cleanup process, ie not when actually processing a build, very curious). > Just crash looping constantly so even when the cleanup succeeds we seemingly have to manually destroy the instance anyway because it fails to gracefully restart after cleanup.