I’m seeing a lot of Flyio videos and blog posts about the combo GPUs and Ollama. It’s nice to being able to deploy GPUs here but Ollama works with OS models and these are very cheap to use through a lot of API’s vendors. If so, what’s the point to pay expensive GPUs hours if you can pay cheap API requests?
Can you provide more details on the pricing of these LLM API vendors? To my knowledge, most (if not all of them) charge based on the input token, which makes sense b/c the compute is based on the input. It looks like things ends up costing as low as $0.0005/call.
Fly’s most expensive GPU is $3.5/hour which is about ~6c/minute… autostop usually kills the machine in ~10 minutes so, it costs 60cents to run a query.
Pure paper napkin math, running your own model on fly would be cheaper at scale vs paying for these vendors. But if you’re doing smaller requests, then yea it makes more sense to use chatGPT, etc…
With $0.6 I can use 10M of Llama 3.1 8B at $0.06/M input tokens on Deepinfra, and without the 10 minutes window. I simple want to understand the case for Ollama here.
Sure, then Deepinfra is cheaper. But what happens when you need to use - for w/e use case - 1B tokens? Running your own model on fly would be cheaper, would it not? The price would still be $3.50 + the standard bandwidth vs $60
To serve 1B will need to pay a lot of bandwidth which is not cheap and I doubt one GPU will cover it. Am I wrong?
Yea, you’re right. It’ll be closer to $50 vs $60
API vendors are doing the equivalent of running one machine and sending multiple users’ requests to them. It’s very different than running dedicated GPUs.
Sending 10 requests per minute to a Fly GPU is going to be much cheaper than the API vendor equivalents. Sending one request every few hours is not. It’s just a trade off in how we implement these things.
In theory, if we could make Machines with GPUs start and stop fast enough, you could get similar expense at very low volumes. That turns out to be hard, though.
So for what you’re doing, the API vendors are probably a good choice. When you have an app with traffic, using the GPUs directly is likely to be better.
I think most of the business will be better paying for the big vendors API than self deployed GPU time cost. But still I would like to see some real business cases using Ollama over Fly GPU and then comparing the costs with APIs.
By the way, how it compares with Replicate? They states:
You only pay for what you use on Replicate, billed by the second. When you don’t run anything, it scales to zero and you don’t pay a thing.
Btw, Ollama or not, the Flyio team girl videos are great, I watch them all.
Replicate sounds like Fly gpus. If you’re already on Fly, why not use Fly’s GPUs to do the LLM?
Looking at Replicate’s pricing: Nvidia A100 (80GB)
is about $5/hour vs $3.5/hour.
I personally wouldn’t want to add another 3rd party dependency to my stack, that’s why I’m running llama
on Fly. But you do you.
Because Replicate scales to zero immediately after the use.
You can pretty much do the same thing on Fly GPUs by exiting the process w/ code 0. You just have to make sure there’s no work being done.
I see. I think I’ll try it and then compare with Replicate. The only thing is, since Flyio is a post paid credit card, I’m afraid of doing something wrong forgetting to exit it and getting huge GPU bills.
Let us know how it goes. I thought fly lets you buy credits with prepaid cards?
Yes you can, but it will only add these credits to the post paid credit card bills expense. So, for example, if I buy $50 and I forget the GPU on, I could get a surprise $400 - $50 (of the credits). The credits going to zero won’t stop the GPUs. It works like that not only for the GPUs, is for the whole platform.
Ah I see. If you set auto_stop_machines = "stop"
, it should kill idle machines w/in 10 minutes.
But if you have something that keeps the machine up, then yea you’ll have a big bill.
For what it’s worth, we’ll refund large, unintentional burst charges. Our business works better if people grow consistently over time, collecting $500 you didn’t mean to spend does not benefit us.
You should probably get a plan with paid email support if you’re worried about this though. Otherwise it might take many days for us to reply to your email.
We use fly GPUs for a few reasons.
It allows us to keep our AI workloads within our servers, minimizing the privacy impact.
It allows us to reuse infrastructure we already have running on fly, e.g. The servers are private-only and accessible through a private flycast so we retain scaling and autostart/autostop while still being easily accesible from our API servers.
We can do the above while keeping costs down thanks to autostart/autostop.
We use 7B size models without volumes which gives us request times from cold start of around 30s and request times from warm of around 3s.
I wonder what’s more cost effective: having the 30 second cold start or paying for the extra storage for the model. There’s bandwidth costs when pulling down the models each time too.
@khuezy the 30s cold start isn’t downloading the model. The model is baked into the docker image. The cold start time is simply the amount of time it takes for the machine to start, ollama to start, the model to be copied into VRAM, and for the inference to be processed.