I need to create and destroy multiple machines on-demand. I’m using the Machines API for this purpose.
These are workers machines used to distribute a long running process.
I find that the time to create machines vary significantly from time to time. In the best case scenario, the machine is created and started within 5 seconds. On average, it takes around 20 seconds. Worst case scenarios I’ve seen is a machine taking over a minute (I’ve seen machines taking 5 minutes to be created).
All machines are created from the same image the creator machine is running on (FLY_IMAGE_REF environment variable)
All it takes is a single machine to take too long to be created for the whole process to be delayed, which breaks the whole purpose of distributed/parallel processing.
The image size in the Fly registry reports to be 436 MB. This is not exactly a small image, but it’s also not a huge one. I’d assume downloading images from the Fly registry to spin up new machines should be fast. Unfortunately, I cannot reduce its size further.
I know you’re going to tell me to create machines beforehand and start them on demand. But at some point, I need to cover for spikes and have to create machines on the fly.
Is there anything I can do to improve creation time? Is there anything Fly.io can do on the infrastructure to improve these times?
We reviewed our internal build machine traces for some of the ones you provided and the slow down can occur when unpacking individual layers from images. What we’ve observed over the years is the time to pull and prepare an image is a combination of two factors: overall image size and total number of layers. From what I could tell, the image you’re using has good number of layers each of which has to be download and unpacked on top of another to produce the final filesystem we turn into a device for firecracker to use.
We continue to look for better ways to make this step faster and as of right now, the only suggestion we can provide is to adjust your image build step to use zstd. We’ve rolled out some changes to further instrument the image pull and prepare process which we hope will reveal more insight into when (and ideally why) some hosts occasionally take longer to pull and unpack layers.