We made GPU machines boot faster

dangra · April 8, 2024, 8:08pm

As of today, starting a GPU machine is 1.6s faster per attached device.

That means less time waiting for your machine to be ready when scaling up or starting from cold. By focusing on saving on start times by the second, not only translates to reduced compute costs for our users, but also unlocks the autostart machines based on incoming requests pattern for more applications.

For the nerdy details read below

Before, machine’s init process called nvidia-smi, a tool provided by NVIDIA that among other things creates the character devices at /dev/nvidia*.

It is important to pre-create the device files before handing over the control to the application’s process because apps dropping root privileges, those with a USER statement in its Dockerfile, won’t be able to.

nvidia-smi served us fine as a first version, but it incurs in a considerable extra time doing tasks we don’t need at this phase, scanning the bus and creating the devices shouldn’t take that long.

What if we can do better? nvidia-smi is closed source, not a pretty job to reverse engineer it; the answer came from nvidia itself and its open-source nvidia-container-toolkit (yay OSS!). True to be told, the open source implementation isn’t complete but a search on NVIDIA forums revealed the rest of the details. With all the details in place, we were able to reimplement device creation in Rust, and now it takes less than a hundred milliseconds.

happy GPU hacking y’all!

Topic		Replies	Views
GPU warm up period? Questions / Help gpu	2	161	May 15, 2024
Exploring Faster Machine Creates Fresh Produce machines	3	251	September 11, 2024
GPU scale to zero gpu	6	590	June 27, 2024
[experimental] Speedy machine creation with overlaybd Fresh Produce machines	6	663	May 1, 2024
GPU reliability improvements and better integration testing Fresh Produce	0	128	June 12, 2024

We made GPU machines boot faster

Related topics