Why machines take a lot of time to be created sometimes?

I need to create and destroy multiple machines on-demand. I’m using the Machines API for this purpose.
These are workers machines used to distribute a long running process.

I find that the time to create machines vary significantly from time to time. In the best case scenario, the machine is created and started within 5 seconds. On average, it takes around 20 seconds. Worst case scenarios I’ve seen is a machine taking over a minute (I’ve seen machines taking 5 minutes to be created).

All machines are created from the same image the creator machine is running on (FLY_IMAGE_REF environment variable)

All it takes is a single machine to take too long to be created for the whole process to be delayed, which breaks the whole purpose of distributed/parallel processing.

The image size in the Fly registry reports to be 436 MB. This is not exactly a small image, but it’s also not a huge one. I’d assume downloading images from the Fly registry to spin up new machines should be fast. Unfortunately, I cannot reduce its size further.

I know you’re going to tell me to create machines beforehand and start them on demand. But at some point, I need to cover for spikes and have to create machines on the fly.

Is there anything I can do to improve creation time? Is there anything Fly.io can do on the infrastructure to improve these times?

Thanks

Hi @empz,

We reviewed our internal build machine traces for some of the ones you provided and the slow down can occur when unpacking individual layers from images. What we’ve observed over the years is the time to pull and prepare an image is a combination of two factors: overall image size and total number of layers. From what I could tell, the image you’re using has good number of layers each of which has to be download and unpacked on top of another to produce the final filesystem we turn into a device for firecracker to use.

We continue to look for better ways to make this step faster and as of right now, the only suggestion we can provide is to adjust your image build step to use zstd. We’ve rolled out some changes to further instrument the image pull and prepare process which we hope will reveal more insight into when (and ideally why) some hosts occasionally take longer to pull and unpack layers.

1 Like

Thanks for the reply. I will try zstd.

Are you also saying that reducing the number of layers on my image should improve the situation also?

At first glance, zstd does improve things significantly.

Are there any plans to support this natively via a setting in fly.toml with Depot builders? Or even make it the default.

add experimental feature to enable zstd for depot builds by jipperinbham · Pull Request #4065 · superfly/flyctl · GitHub should hopefully allow you to start using depot builders with zstd enabled once merged and a new release cut.

1 Like

That’s awesome!

How would I know when this is released already?

Sorry for not responding last week!

flyctl has a defined release schedule so a new release should go out later today.

1 Like

Amazing.

I should be able to simply add use_zstd = true within the experimental section, right?

Yep, that’s all you’ll need to do.

It looks like the GH action for this release failed.

Thanks for the heads up. We’re looking into it.

v0.3.38 has been published

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.