Docker image size limit raised from 2GB to 8GB

Previously, when deploying a docker image that was over 2GB in size, fly.io would throw an error. You’re now able to deploy images that are up to roughly 8GB.

We’re expecting this is particularly useful to those of you doing work on machine learning, or running developer environments on fly.io. We’re interested in knowing about your use cases! Does this increase let you to run something you couldn’t run before? Should we raise the limit even higher?

11 Likes

Have been keeping an eye on using fly machines to augment our game server orchestration - larger image size limit is helpful just even to know if our server images get even larger in the future we won’t hit a bottleneck at that layer. It also begins to get closer to parity with PlayFab servers’ 10gb asset limit.

Last time I took a stab at adding fly machines to our matchmaker-based orchestration believe I hit a snag configuring machine sizes. That was last year before the machines improvements / public API addition so am excited to give it another go!

10 Likes

I still run into issues because a layer push fails at approx 4GB. I’m doing some ML stuff.

Can you re-run it with LOG_LEVEL=debug and share the output?

I agree, it is not possible to push a layer above 4.3GB or so. It simply fails repeatedly to push the layer around that point. A layer of pip install autogluon.tabular[all] should load enough cursed dependencies to trigger this issue.

We updated our registry last week and large layer uploads should work more reliably now. They will still likely be slow, and they now should be able to complete

Previously, we had a http request body read timeout set to 5 minutes. It would often take >5 minutes for layers larger than a couple GB to stream the full layer content, which would hit that timeout. Last week we increased that read timeout to 60 minutes, which should allow even 8GB layers to succeed.

We have some ideas about how to get resumable pushes and concurrent pushes for individual layers working. Resumable pushes will allow the client to resume uploading for a single layer where a previously push failed—the previous issue was exacerbated by requiring the pusher to start over each time. Concurrent pushes will help speed up pushes for large layers. We’ll make a fresh produce post when those are implemented.

1 Like

I’m trying to deploy a Docker container that uses torch, but the resulting container is unimaginably large at 11 GB. When I build the Docker container locally, my Rancher Desktop console says the container is 1.35 GB.

IN case it’s useful, here is a link to app: registry.fly.io/elocator

Any tips?

I found a solution for CPU only model inference. I’m using poetry for version control, and couldn’t get poetry to play nicely with the CPU version of torch (though you may have more success, it seems others have).

Instead, I removed torch from poetry and then added to my Dockerfile: RUN pip3 install torch --index-url https://download.pytorch.org/whl/cpu

Reduced the size from 11 GB to 2 GB… now I’m up and running.

Hey there!

Are the model files being built into that container images?

(vs being downloaded after running?). Usually we would attach a volume and have the models download to that location. Then you can re-use that volume on new machines so the models exist there.

For example, if you try to reuse a container at https://replicate.com, it will easily exceed the 8GB limit. I also think that the best thing to do is to make external registries available.

I have a tenuous sense that Cog often increases the size of an image, but I don’t really know, and its ease is still appealing.

Presumably there still has to be a copy of the image on the hardware where the firecracker VM gets started on - that might have a limit besides the one the Fly registry imposes.

Yes, the Fly team will probably have to increase the default size allowed for rootfs or we will have to consider using bottomless storage on our own.