GPU warm up period?

I’m working on a custom inference server and I’d like the server to be ready to serve requests as quickly as possible. I’m observing some behavior that I don’t see on local GPUs during testing: the first request takes quite a while to complete when hitting the GPU.

Is there something that happens at the hypervisor level when the GPU is used for the first time? If so, is there a way to avoid it/cache it?

Hmm that’s odd - do you have logs on the behavior? What GPU type and region are you using?

Are you launching a machine and waiting for it to create and serve a request, or are you starting an existing machine? Starts should be very fast.