Hi! I’m doing some tests here with Ollama.
I’ve set a l40s 32GB RAM and 50GB volume to pull a 72B model.
As the logs below I see it took about 2 minutes from the cold start to finish the chat completion and about 4 minutes to stop.
I’m using flycast, auto_stop_machines = ‘stop’ and min_machines_running = 0
Another thing I’ve realized it is forces to get a performance 8x CPU.
How to get more performance and for less cost running Ollama?
When it comes I want to see the details in the app billing to check how it billed.
It depends on your trade off of UX vs
I’m assuming you don’t have a giant VC backed vault of cash to burn, so you don’t mind having the user wait a bit.
Is the 72B model needed? If not, try the lower one since loading that much into memory during bootup will affect the cold start time.
Instead of wasting the 2-3 minutes of idle costs, wrap your ollama app in a server that proxies the request to ollama, that way you have control over when to kill the process.
eg, after each /api/chat/completions, set a timeout that exit 0. If a new chat request comes, clear and reset that timeout. Over time that will save you a decent amount of
I’m using a Chat UI in another app calling the completion through the flycast. I also tried to hit the port through .internal:11434 . I don’t know why but it killed faster than using flycast.
Yes, I’m looking for to set it.
Rather than the l40s costs, I assume I’m also paying for the default Performance 8x CPU and the 32GB VM RAM…It’s not that clear in the docs.