Faster, more reliable remote image builds + deploys

We’ve been working on some big updates to the build+deploy experience! We’ve added a local mirror for the registry.fly.io registry service that manages Docker images used to deploy Machines for your app, and also released a Buildkit-based rewrite of our semi-automated, flyctl-managed builder app that builds an image from source code. These two updates combined bring you faster, more reliable builds and deploys.

Try it out with the new --buildkit deploy flag, and please let us know how it goes!

fly deploy --buildkit

How much faster?

It depends on the region where the builder and machine are located, and the type of app being built. The greatest improvements will be for remote builders and machine deploys in regions far away from iad, and for builds spending most of their time waiting on pushing large images to the registry (with mostly-cached, or lightweight build steps).

For an extreme example, we timed a fully-cached build of a FROM ubuntu image, on idle (stopped) remote builders in the nrt region, and ran identical builds across the existing Depot and legacy (rchab-based) remote builders for comparison:

So at the extreme, something on the order of a ~20x speedup should be possible, for some (certainly not all!) use cases.

Technical Details

Fly Registry, with less Registry

The Fly Registry (registry.fly.io) is a distributed Fly app with instances covering the globe, all using a shared object-storage bucket near the iad region to store image data in a central, consistent location. For many cases, especially when the image is large and/or the builder is in a region far from iad, this results in very long image transfer times over slow, globe-spanning public Internet routes, and possibly timeouts or other failures if packet loss or network partitions occur. To improve this, we came up with a solution that removes the Registry from the deploy process entirely, turning image push+pull steps into fast and reliable local operations.

We built a registry-mirror service, available to Machines through the internal Machines API (_api.internal:5000), that allows a properly-configured builder Machine to push registry.fly.io-namespaced images directly into its parent host’s containerd service, where it’s immediately ready to deploy Machines. (A background process still pushes the image to the upstream server where it will be written to durable object storage.)

Pushing images to a proxy mirror is not currently supported with existing container-image tooling, so to make it work we wrote a patch for Buildkit that we’ve submitted upstream, and in the meantime we’re using this patch in a refresh of our Fly Builder app.

We’re also working on some further architectural changes so the registry.fly.io server itself can handle this host-local mirroring, so this improvement can be used with unpatched Buildkit, or existing, unmodified Depot builders.

Fly Builder refresh

We also did a pass through our legacy Fly Builder app (rchab, est. Jan 2021), with an eye towards improving its performance and reliability by leveraging modern platform features. The result is a rewrite of the builder app that’s basically a very thin wrapper around Buildkit. This refresh brings several improvements:

  • Running Buildkit directly is more lightweight than Docker Engine, and the builder boots several seconds faster;

  • Securely connects over your org’s Flycast private IPv6 network, instead of over public IPv4 with a separate auth layer stacked on top;

  • Proxy-manged autostart and autostop keep the app simple, reliable, and resource-efficient between builds.

Try it out!

We’ve added tooling to the latest release of flyctl so you can opt-in to the new Fly Builder setup through a simple command-line flag (fly deploy --buildkit). We’re also working on adding transparent support for the registry mirror in existing Depot builders and any other tools that use registry.fly.io directly, so soon you should be able to leverage these registry improvements with whichever builder service works best for your setup.

Please try it out and give us feedback on these updates!

7 Likes

Awesome! Does this mean Fly deployments no longer depend on Depot when using this flag?

When trying --buildkit I get this error, what does it mean?

Error: failed to fetch an image or build from source: failed to create buildkit client: failed to launch VM: Deploying over the remote builder is not allowed. (Request ID: 01K4FRAHCPBVQ5GKHJ9K4XZ229-fra) (Trace ID: 431043a8ad33c8356f26e0b828805059)

ah, thanks for the feedback, it looks like we needed to deploy one more change to make the remote builder available to everyone. Could you try it again?

Correct, the --buildkit remote builder is an update to the legacy Fly Builder, that’s separate from Depot.

1 Like

Just tried and it’s much faster than before, new deployments take a few seconds to build. Thank you!

1 Like

It’s curious because we had to revert back to –depot builds.
With depot, 2-3 min build to deploy.
With buildkit, 20 min and still building like the first stage.

Any idea to look up?

Yes, especially for larger and more compute-intensive builds, you’ll likely see a slower build when resetting or switching remote builders, for a couple possible reasons:

  • builders cache image layers on local disk, which can speed up rebuilds where many layers are unchanged, and makes initial un-cached builds on a new builder much slower;
  • builder Machines use a shared CPU kind with burstable CPU performance and an initial burst balance of 50 seconds. For CPU-heavy builds, once this initial balance is exhausted the builder will run at its baseline (6.25%) for the remainder. Over time with ongoing use a builder can accumulate up to 8 minutes of burst balance that it can use to run future builds in 100% bursts, but new builders are more likely to get stuck at the baseline. (We’re aware this isn’t ideal and not very visible/controllable, and are looking at some ways to improve this moving forward.)
  • there may be builder configuration differences (CPU/RAM sizes or region) that might help explain other discrepancies.

Build times went from 3m 39s to 4m 15s and 2m 7s to 4m 15s on buildkit deployed form GitHub actions. Perhaps they will be faster on subsequent builds? They do seem to start quicker though.

Just to pop another idea in this thread; if anyone has slow build times, but happens to be using a beefy CI server, I think you can do Docker builds locally, and then push the image to Fly. Depending on the hardware you have, this may be faster.

Of course, the other option is to optimise your builds. Making them small is a good first step, e.g. use Alpine where possible, instead of Ubuntu. You can also build an intermediate image that rarely changes, so that your builds go on top of that, rather than building everything from scratch.