We use fly GPUs for a few reasons.
It allows us to keep our AI workloads within our servers, minimizing the privacy impact.
It allows us to reuse infrastructure we already have running on fly, e.g. The servers are private-only and accessible through a private flycast so we retain scaling and autostart/autostop while still being easily accesible from our API servers.
We can do the above while keeping costs down thanks to autostart/autostop.
We use 7B size models without volumes which gives us request times from cold start of around 30s and request times from warm of around 3s.