RAG Chatbot w/ GPU - high cost per month? able to scale/elasticity?

madsciai · November 22, 2024, 2:42pm

TLDR: is it expensive to run GPUs in Fly?

Hi, I’m hoping to launch my RAG AI chatbot app on Fly soon, and I’m looking at possible costs. I’ve heard from other devs/Reddit etc. that Fly is super reasonably priced, low monthly bills, which is super great especially at the beginning.

However, in my case I’m looking at GPUs and I guess Volumes for storing chat histories etc. volumes can store vector embeddings?
Might have been messing up with the calculator but with my guesstimates using the L40S put me at nearly $1k/month. A10 did a bit worse, really.

And it’s very possible my site will not see a lot of traffic at all and will still help me aggregate job matches and I’ll get what I want. However, it isn’t an AI tool people have seen much of before, but there are other AI applications being built in the same hiring/HR sector.

I very much do want to be able to (auto?)scale to high traffic/traffic elasticity if it somehow comes to that/gets viral/hype etc. I want to be very prepared. (though mindful of Murphy’s Law…)

The app’s entire purpose is to get me a new job, so if it goes viral I’ll be choosing amongst the highest bidders… so it’s okay if I get a high Fly bill at that point. I’m willing to bet on myself.

But, I read the GPUs are priced in usage per second within the quota, so my calculator fears may be overblown.

Any input is appreciated! I’m brand new to Fly, though I’ve heard y’all on Changelog when I listen to them! <3

khuezy · November 23, 2024, 2:17am

You’ll need to implement your own metrics scaler Autoscale based on metrics · Fly Docs

To minimize costs you’ll need to manage when to auto stop. By default fly will suspend machines after 7-15 minutes, but you can use the fly machine API to shutdown earlier. Depending on the model size it can take 30+ seconds to load your models into memory so you’ll need to balance the trade off between UX and cost.

madsciai · November 23, 2024, 3:48am

Gotcha, I’ve been loading models on my local machine with LM Studio and it’s handy they show the model size and I know how long it takes to load, at least for me.

And thank you for the pointer on auto stop timing. I think I could take advantage of the fly machine API for the first release.

As for the metrics scaler, I’d be curious if you knew of any metrics that’ve worked for AI, what’s best to measure and how. I’ve seen lots of interesting evaluators.

system · November 30, 2024, 3:48am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fly GPUs Are Here Fresh Produce gpu	15	2575	February 20, 2024
Fly machines autoscaling Questions / Help machines	1	502	April 12, 2023
GPUs are generally available! Fresh Produce gpu	2	1028	June 12, 2024
GPU Benchmarking Fresh Produce gpu	16	1538	April 8, 2024
Feature Request: Billing breakdown per app	3	216	December 5, 2023

RAG Chatbot w/ GPU - high cost per month? able to scale/elasticity?

Related topics