RAG Chatbot w/ GPU - high cost per month? able to scale/elasticity?

TLDR: is it expensive to run GPUs in Fly?

Hi, I’m hoping to launch my RAG AI chatbot app on Fly soon, and I’m looking at possible costs. I’ve heard from other devs/Reddit etc. that Fly is super reasonably priced, low monthly bills, which is super great especially at the beginning.

However, in my case I’m looking at GPUs and I guess Volumes for storing chat histories etc. volumes can store vector embeddings?
Might have been messing up with the calculator but with my guesstimates using the L40S put me at nearly $1k/month. A10 did a bit worse, really.

And it’s very possible my site will not see a lot of traffic at all and will still help me aggregate job matches and I’ll get what I want. However, it isn’t an AI tool people have seen much of before, but there are other AI applications being built in the same hiring/HR sector.

I very much do want to be able to (auto?)scale to high traffic/traffic elasticity if it somehow comes to that/gets viral/hype etc. I want to be very prepared. (though mindful of Murphy’s Law…)

The app’s entire purpose is to get me a new job, so if it goes viral I’ll be choosing amongst the highest bidders… so it’s okay if I get a high Fly bill at that point. I’m willing to bet on myself.

But, I read the GPUs are priced in usage per second within the quota, so my calculator fears may be overblown.

Any input is appreciated! I’m brand new to Fly, though I’ve heard y’all on Changelog when I listen to them! <3

You’ll need to implement your own metrics scaler Autoscale based on metrics · Fly Docs

To minimize costs you’ll need to manage when to auto stop. By default fly will suspend machines after 7-15 minutes, but you can use the fly machine API to shutdown earlier. Depending on the model size it can take 30+ seconds to load your models into memory so you’ll need to balance the trade off between UX and cost.

Gotcha, I’ve been loading models on my local machine with LM Studio and it’s handy they show the model size and I know how long it takes to load, at least for me.

And thank you for the pointer on auto stop timing. I think I could take advantage of the fly machine API for the first release.

As for the metrics scaler, I’d be curious if you knew of any metrics that’ve worked for AI, what’s best to measure and how. I’ve seen lots of interesting evaluators.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.