I’m working on a custom inference server and I’d like the server to be ready to serve requests as quickly as possible. I’m observing some behavior that I don’t see on local GPUs during testing: the first request takes quite a while to complete when hitting the GPU.
Is there something that happens at the hypervisor level when the GPU is used for the first time? If so, is there a way to avoid it/cache it?