Flyio, GPUs and Ollama

Mrka · August 14, 2024, 6:04pm

I’m seeing a lot of Flyio videos and blog posts about the combo GPUs and Ollama. It’s nice to being able to deploy GPUs here but Ollama works with OS models and these are very cheap to use through a lot of API’s vendors. If so, what’s the point to pay expensive GPUs hours if you can pay cheap API requests?

khuezy · August 14, 2024, 6:28pm

Can you provide more details on the pricing of these LLM API vendors? To my knowledge, most (if not all of them) charge based on the input token, which makes sense b/c the compute is based on the input. It looks like things ends up costing as low as $0.0005/call.

Fly’s most expensive GPU is $3.5/hour which is about ~6c/minute… autostop usually kills the machine in ~10 minutes so, it costs 60cents to run a query.

Pure paper napkin math, running your own model on fly would be cheaper at scale vs paying for these vendors. But if you’re doing smaller requests, then yea it makes more sense to use chatGPT, etc…

Mrka · August 14, 2024, 7:44pm

With $0.6 I can use 10M of Llama 3.1 8B at $0.06/M input tokens on Deepinfra, and without the 10 minutes window. I simple want to understand the case for Ollama here.

khuezy · August 14, 2024, 7:48pm

Sure, then Deepinfra is cheaper. But what happens when you need to use - for w/e use case - 1B tokens? Running your own model on fly would be cheaper, would it not? The price would still be $3.50 + the standard bandwidth vs $60

Mrka · August 14, 2024, 7:59pm

To serve 1B will need to pay a lot of bandwidth which is not cheap and I doubt one GPU will cover it. Am I wrong?

khuezy · August 14, 2024, 8:05pm

Yea, you’re right. It’ll be closer to $50 vs $60

kurt · August 14, 2024, 8:14pm

API vendors are doing the equivalent of running one machine and sending multiple users’ requests to them. It’s very different than running dedicated GPUs.

Sending 10 requests per minute to a Fly GPU is going to be much cheaper than the API vendor equivalents. Sending one request every few hours is not. It’s just a trade off in how we implement these things.

In theory, if we could make Machines with GPUs start and stop fast enough, you could get similar expense at very low volumes. That turns out to be hard, though.

So for what you’re doing, the API vendors are probably a good choice. When you have an app with traffic, using the GPUs directly is likely to be better.

Mrka · August 14, 2024, 8:20pm

I think most of the business will be better paying for the big vendors API than self deployed GPU time cost. But still I would like to see some real business cases using Ollama over Fly GPU and then comparing the costs with APIs.
By the way, how it compares with Replicate? They states:

You only pay for what you use on Replicate, billed by the second. When you don’t run anything, it scales to zero and you don’t pay a thing.

Mrka · August 14, 2024, 8:41pm

Btw, Ollama or not, the Flyio team girl videos are great, I watch them all.

khuezy · August 14, 2024, 8:49pm

Replicate sounds like Fly gpus. If you’re already on Fly, why not use Fly’s GPUs to do the LLM?
Looking at Replicate’s pricing: Nvidia A100 (80GB) is about $5/hour vs $3.5/hour.

I personally wouldn’t want to add another 3rd party dependency to my stack, that’s why I’m running llama on Fly. But you do you.

Mrka · August 14, 2024, 9:24pm

Because Replicate scales to zero immediately after the use.

khuezy · August 14, 2024, 9:29pm

You can pretty much do the same thing on Fly GPUs by exiting the process w/ code 0. You just have to make sure there’s no work being done.

Mrka · August 14, 2024, 9:38pm

I see. I think I’ll try it and then compare with Replicate. The only thing is, since Flyio is a post paid credit card, I’m afraid of doing something wrong forgetting to exit it and getting huge GPU bills.

khuezy · August 14, 2024, 9:43pm

Let us know how it goes. I thought fly lets you buy credits with prepaid cards?

Mrka · August 14, 2024, 10:16pm

Yes you can, but it will only add these credits to the post paid credit card bills expense. So, for example, if I buy $50 and I forget the GPU on, I could get a surprise $400 - $50 (of the credits). The credits going to zero won’t stop the GPUs. It works like that not only for the GPUs, is for the whole platform.

khuezy · August 14, 2024, 10:24pm

Ah I see. If you set auto_stop_machines = "stop", it should kill idle machines w/in 10 minutes.
But if you have something that keeps the machine up, then yea you’ll have a big bill.

kurt · August 14, 2024, 11:02pm

For what it’s worth, we’ll refund large, unintentional burst charges. Our business works better if people grow consistently over time, collecting $500 you didn’t mean to spend does not benefit us.

You should probably get a plan with paid email support if you’re worried about this though. Otherwise it might take many days for us to reply to your email.

charsleysa · August 15, 2024, 1:57am

We use fly GPUs for a few reasons.

It allows us to keep our AI workloads within our servers, minimizing the privacy impact.

It allows us to reuse infrastructure we already have running on fly, e.g. The servers are private-only and accessible through a private flycast so we retain scaling and autostart/autostop while still being easily accesible from our API servers.

We can do the above while keeping costs down thanks to autostart/autostop.

We use 7B size models without volumes which gives us request times from cold start of around 30s and request times from warm of around 3s.

khuezy · August 15, 2024, 8:43pm

I wonder what’s more cost effective: having the 30 second cold start or paying for the extra storage for the model. There’s bandwidth costs when pulling down the models each time too.

charsleysa · August 16, 2024, 12:28am

@khuezy the 30s cold start isn’t downloading the model. The model is baked into the docker image. The cold start time is simply the amount of time it takes for the machine to start, ollama to start, the model to be copied into VRAM, and for the inference to be processed.

Topic		Replies	Views
GPU Benchmarking Fresh Produce gpu	16	1536	April 8, 2024
Fly GPUs Are Here Fresh Produce gpu	15	2572	February 20, 2024
GPUs are generally available! Fresh Produce gpu	2	1022	June 12, 2024
Recommended GPU Approach Questions / Help	2	1030	May 4, 2024
Get help with Fly GPUs! gpu	26	1463	February 15, 2025

Flyio, GPUs and Ollama

Related topics