Testing Ollama and L40S

Hi! I’m doing some tests here with Ollama.
I’ve set a l40s 32GB RAM and 50GB volume to pull a 72B model.
As the logs below I see it took about 2 minutes from the cold start to finish the chat completion and about 4 minutes to stop.
I’m using flycast, auto_stop_machines = ‘stop’ and min_machines_running = 0
Another thing I’ve realized it is forces to get a performance 8x CPU.
How to get more performance and for less cost running Ollama?
When it comes I want to see the details in the app billing to check how it billed.

2024-09-06 11:49:50.584 Starting machine

2024-09-06 11:51:53.234
[GIN] 2024/09/06 - 14:51:53 | 200 | 1.330684722s | POST “/v1/chat/completions”

2024-09-06T14:55:12.643 proxy ord [info] App has excess capacity, autostopping machine

It depends on your trade off of UX vs :money_with_wings:
I’m assuming you don’t have a giant VC backed vault of cash to burn, so you don’t mind having the user wait a bit.

Is the 72B model needed? If not, try the lower one since loading that much into memory during bootup will affect the cold start time.

Instead of wasting the 2-3 minutes of idle costs, wrap your ollama app in a server that proxies the request to ollama, that way you have control over when to kill the process.

eg, after each /api/chat/completions, set a timeout that exit 0. If a new chat request comes, clear and reset that timeout. Over time that will save you a decent amount of :moneybag:

I’m using a Chat UI in another app calling the completion through the flycast. I also tried to hit the port through .internal:11434 . I don’t know why but it killed faster than using flycast.

Yes, I’m looking for to set it.

Rather than the l40s costs, I assume I’m also paying for the default Performance 8x CPU and the 32GB VM RAM…It’s not that clear in the docs.

B/c .internal doesn’t go through the fly proxy which keeps the machine awake. Downside is that it won’t wake up your app if it’s asleep.

again… don’t hit the ollama server itself (:11434), hit a wrapper.

Yes, I’m trying it.
Just a L40S + performance 8x and 64 RAM it is $2.0974 /h in ORD.
Need to optimize and cut costs.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.