Why machines take a lot of time to be created sometimes?

empz · November 12, 2024, 1:44pm

I need to create and destroy multiple machines on-demand. I’m using the Machines API for this purpose.
These are workers machines used to distribute a long running process.

I find that the time to create machines vary significantly from time to time. In the best case scenario, the machine is created and started within 5 seconds. On average, it takes around 20 seconds. Worst case scenarios I’ve seen is a machine taking over a minute (I’ve seen machines taking 5 minutes to be created).

All machines are created from the same image the creator machine is running on (FLY_IMAGE_REF environment variable)

All it takes is a single machine to take too long to be created for the whole process to be delayed, which breaks the whole purpose of distributed/parallel processing.

The image size in the Fly registry reports to be 436 MB. This is not exactly a small image, but it’s also not a huge one. I’d assume downloading images from the Fly registry to spin up new machines should be fast. Unfortunately, I cannot reduce its size further.

I know you’re going to tell me to create machines beforehand and start them on demand. But at some point, I need to cover for spikes and have to create machines on the fly.

Is there anything I can do to improve creation time? Is there anything Fly.io can do on the infrastructure to improve these times?

Thanks

JP_Phillips · November 12, 2024, 10:04pm

Hi @empz,

We reviewed our internal build machine traces for some of the ones you provided and the slow down can occur when unpacking individual layers from images. What we’ve observed over the years is the time to pull and prepare an image is a combination of two factors: overall image size and total number of layers. From what I could tell, the image you’re using has good number of layers each of which has to be download and unpacked on top of another to produce the final filesystem we turn into a device for firecracker to use.

We continue to look for better ways to make this step faster and as of right now, the only suggestion we can provide is to adjust your image build step to use zstd. We’ve rolled out some changes to further instrument the image pull and prepare process which we hope will reveal more insight into when (and ideally why) some hosts occasionally take longer to pull and unpack layers.

empz · November 13, 2024, 4:23am

Thanks for the reply. I will try zstd.

Are you also saying that reducing the number of layers on my image should improve the situation also?

empz · November 13, 2024, 5:28am

At first glance, zstd does improve things significantly.

Are there any plans to support this natively via a setting in fly.toml with Depot builders? Or even make it the default.

JP_Phillips · November 15, 2024, 5:22pm

add experimental feature to enable zstd for depot builds by jipperinbham · Pull Request #4065 · superfly/flyctl · GitHub should hopefully allow you to start using depot builders with zstd enabled once merged and a new release cut.

empz · November 15, 2024, 5:34pm

That’s awesome!

How would I know when this is released already?

JP_Phillips · November 18, 2024, 7:37pm

Sorry for not responding last week!

flyctl has a defined release schedule so a new release should go out later today.

empz · November 18, 2024, 7:50pm

Amazing.

I should be able to simply add use_zstd = true within the experimental section, right?

JP_Phillips · November 18, 2024, 7:52pm

Yep, that’s all you’ll need to do.

empz · November 18, 2024, 9:58pm

It looks like the GH action for this release failed.

JP_Phillips · November 18, 2024, 10:01pm

Thanks for the heads up. We’re looking into it.

JP_Phillips · November 18, 2024, 10:46pm

v0.3.38 has been published

system · November 25, 2024, 10:46pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Slow app creation and machine running same image - Singapore Questions / Help	1	14	December 18, 2024
[experimental] Speedy machine creation with overlaybd Fresh Produce machines	6	676	May 1, 2024
Exploring Faster Machine Creates Fresh Produce machines	3	256	September 11, 2024
Creating machines via Machines API is very slow compared to "fly scale count"	3	53	October 16, 2024
Cloning and resuming a suspended machine? Questions / Help	6	32	June 3, 2025

Why machines take a lot of time to be created sometimes?

Related topics