Testing Ollama and L40S

Mrka · September 6, 2024, 3:09pm

Hi! I’m doing some tests here with Ollama.
I’ve set a l40s 32GB RAM and 50GB volume to pull a 72B model.
As the logs below I see it took about 2 minutes from the cold start to finish the chat completion and about 4 minutes to stop.
I’m using flycast, auto_stop_machines = ‘stop’ and min_machines_running = 0
Another thing I’ve realized it is forces to get a performance 8x CPU.
How to get more performance and for less cost running Ollama?
When it comes I want to see the details in the app billing to check how it billed.

2024-09-06 11:49:50.584 Starting machine

2024-09-06 11:51:53.234
[GIN] 2024/09/06 - 14:51:53 | 200 | 1.330684722s | POST “/v1/chat/completions”

2024-09-06T14:55:12.643 proxy ord [info] App has excess capacity, autostopping machine

khuezy · September 6, 2024, 3:39pm

It depends on your trade off of UX vs
I’m assuming you don’t have a giant VC backed vault of cash to burn, so you don’t mind having the user wait a bit.

Is the 72B model needed? If not, try the lower one since loading that much into memory during bootup will affect the cold start time.

Instead of wasting the 2-3 minutes of idle costs, wrap your ollama app in a server that proxies the request to ollama, that way you have control over when to kill the process.

eg, after each /api/chat/completions, set a timeout that exit 0. If a new chat request comes, clear and reset that timeout. Over time that will save you a decent amount of

Mrka · September 6, 2024, 4:07pm

I’m using a Chat UI in another app calling the completion through the flycast. I also tried to hit the port through .internal:11434 . I don’t know why but it killed faster than using flycast.

Yes, I’m looking for to set it.

Rather than the l40s costs, I assume I’m also paying for the default Performance 8x CPU and the 32GB VM RAM…It’s not that clear in the docs.

khuezy · September 6, 2024, 4:26pm

B/c .internal doesn’t go through the fly proxy which keeps the machine awake. Downside is that it won’t wake up your app if it’s asleep.

again… don’t hit the ollama server itself (:11434), hit a wrapper.

Mrka · September 8, 2024, 2:31pm

Yes, I’m trying it.
Just a L40S + performance 8x and 64 RAM it is $2.0974 /h in ORD.
Need to optimize and cut costs.

system · September 15, 2024, 2:31pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Networking for RAG chatbot w/ Ollama + Chroma? ready to deploy! Build debugging machines , databases , volumes , gpu	11	164	December 9, 2024
Autoscaling is constantly stopping and starting instances even with absurdly high soft_limit of 100k Questions / Help metrics , autoscaling	3	68	February 7, 2025
RAG Chatbot w/ GPU - high cost per month? able to scale/elasticity? Questions / Help machines , gpu , autoscaling	3	81	November 30, 2024
Configuring Python app to listen + bind to internal ports for 2 private service apps? Python volumes , proxy	8	106	January 16, 2025
Fly io app running, despite auto stop set to stop autoscaling	2	129	September 25, 2024

Testing Ollama and L40S

Related topics