Exploring Faster Machine Creates

Exploring Faster Machine Creates

tl;dr

We are working on making machine creates faster. You probably care about this, but probably not yet. Nothing is changing right now as a consumer of the machines API. But changes are coming and we want to let you know what we’re working on. Feel free to ignore the details below and just know that faster machine creates are coming, they are just just here yet.

Context

We want machine “creates” to be fast. There are a lot of moving parts to go from a Docker image to a running machine instance. Machine instances run as Firecracker VMs and many of the moving parts are involved in transmogrifying the Docker image into a mountable “rootfs” filesystem for the Firecracker VM.

It takes time to pull an image from the registry and it takes time to unpack successive layers of the image into a filesystem. We are exploring fundamental changes to the way we make the roots filesystem available and we want to highlight some of the early work around this. This has the potential to make machine creates significantly more responsive and we’re excited about what this means for how machines are used. We are not yet at the point where you will notice any changes to machine create times but we are laying groundwork for this.

What is involved in creating a machine?

We run machines as Firecracker VMs, with a rootfs filesystem prepared from a Docker image. This is why you can easily launch an app on Fly.io from a simple Dockerfile. We leverage containerd to manage Docker images on our servers and to prepare the necessary rootfs filesystem image for each machine instance.

Firecracker requires the rootfs to be provided as a filesystem image. We use containerd and the containerd devmapper-snapshotter to easily prepare a rootfs filesystem image from a Docker image [1]. Each layer is effectively a tar file containing new and updated files and each layer need to be “unpacked” onto the filesystem, applying the changes involved in each layer to construct the final filesystem. We then pass the final resulting snapshot device directly to Firecracker as the rootfs.

There are multiple things that impact the time it takes to prepare the rootfs -

  • We likely do not have the Docker image available on the host
  • One or more image layers may be missing even if a subset of shared layers are available locally
  • Multiple roundtrips between local containerd and the registry are required to retrieve the data for each layer (index, manifest, config, layer content etc.)
  • The content of each layer then needs to be unpacked into successive device mapper snapshots

Possible Approaches

We want to explore ways to reuse existing rootfs images to avoid incurring the cost of pulling image layers and unpacking them for each machine create. Ideally we can build a specific rootfs image once and (re)use it multiple times, across multiple machine instances and across multiple hosts. There are many approaches we can take to achieve this - we could build and serve them from somewhere more centralized, distributing pre-prepared rootfs on demand. We could pre-emptively load them onto hosts in preparation for machines being created. We could share them peer-to-peer between hosts taking advantage of fast intra-region connections between hosts. We could “lazily hydrate” them, accessing rootfs as a remote block device and allowing a machine VM to launch before the rootfs data is fully available locally. But there is some preparatory work that needs to be done first before we can experiment with these different approaches. We need to break some existing dependencies and the assumption that the rootfs is always built from an image retrieved from the registry.

We made machine creates more flexible

To experiment with reusing existing images we made the machine VM lifecycle more flexible. The rootfs is now “pluggable” with machine creation no longer tied to a specific image “pull and prepare” implementation. This allows us to remove the assumption that a machine always interacts with containerd to obtain the image.

Technical details

The lifecycle of a machine VM was tightly coupled to the lifecycle of a containerd image snapshot. We introduced an abstraction at the point where the rootfs filesystem image is provided to Firecracker. We can still create a Firecracker VM instance with a rootfs prepared via containerd. But we can now optionally “override” this with an alternative, previously prepared image.

So where used to do something roughly along these lines -

  1. create machine created referencing a specific image
  2. missing layers are pulled from the registry (relatively slow)
  3. rootfs is prepared by unpacking successive layer snapshots (relatively slow)
  4. provide this final rootfs filesystem snapshot to Firecracker

We can now optionally configure a machine instance to reuse an existing image -

  1. create machine referencing a specific image
  2. locate an instance of an existing rootfs image to be reused (this is where we can now experiment with alternative approaches)
  3. provide a copy of this existing rootfs instance to Firecracker

Summary

We have introduced some changes to how we provide a rootfs filesystem to a Firecracker VM of a machine instance. These changes make it much more flexible with a “pluggable” rootfs approach. This change lets us explore alternative approaches, reusing existing rootfs filesystem images allowing machine creates to be faster and more responsive. We are going to have some interesting updates to share related to this so watch this space.


  1. The terminology gets a little confusing here as we use image to refer to two different but overlapping concepts - the final filesystem image vs. the Docker image retrieved from the registry (consisting of one or more layers, each containing a set of files and directories that together make up the contents of the filesystem). ↩︎

7 Likes

Does this also apply to GPU machines too?

It will yes. One of the primary drivers for “faster machine creates” is to reduce the time it takes to create GPU machines.

Just note that you won’t see any performance improvements yet as this is still preparatory work that we need to tackle before we can speed machine creates up.

This seems like a performant workaround for what is (for me at least) another misfeature in Docker. The ‘sandwich filesystem layers using btrfs’ trick works fine on a single host, but for deployment to 2-N instances I’d rather send over the assembled rootfs sandwich vs a list of sandwich ingredients.

2 Likes