Help me reason about best practice for ephemeral machines with large volumes / storage requirements

infovore · May 29, 2025, 3:09pm

I’m thinking about some upcoming architecture. I’m working on some code that involves processing large (multi-gigabyte) image files.

Currently, I’m spinning up a meaty performance container on a request, grabbing a large file from Tigris, doing a bunch of tasks, writing data back to Tigris, and then letting scale-to-zero kick in.

This is fine for prototyping. As I think about moving to production, the requirements become more complex. One option is to queue all these requests, and ensure the number of simultaneous queue tasks is the same as the maximum scale of these machines. But that feels limiting, and not great for scale.

I’m likely to have a queue anyway, just to limit concurrency: these tasks are slow/complex, but performed infrequently, and I really don’t want to be running these machines when I don’t have to. At the same time… I was beginning to wonder if I’m just going to need n machines and n attached volumes to work as scratch space.

Perhaps the correct approach would be to dynamically create volumes via the API, though I’m not quite sure where the place to do that would be.

Anyhow, I’m thinking out loud, but also asking people who are more experienced with Fly-shaped infrastructure for suggestions.

halfer · May 29, 2025, 9:02pm

Can you speak more to this theme? This sounds fine for production.

I do something similar with web crawlers. I have a web app that handles the user side, and a “distributor” app that handles crawler requests. Both of these are small and always-on. The distributor starts machines that self-terminate (and since I use machine create --rm it removes the historical existence of the machine). The parallel limit is enforced in the distributor, and based on the request, it can spin up a small or a large machine.

I wonder if my only error is that I’ve written a fair bit of queue logic that I should have used a third-party for, but it is mostly done now.

Perhaps the correct approach would be to dynamically create volumes via the API, though I’m not quite sure where the place to do that would be.

What I am doing here is sending the results back to a long-running machine, via an API, so the ephemeral machines do not need a volume. They only die once they have copied their results somewhere safe.

zienkikk · June 2, 2025, 1:34am

I’m watching this thread because I’m curious where it goes but thought I would chime in on one point of @halfer:

so the ephemeral machines do not need a volume

I would imagine @infovore needs a volume because without it, the root FS only has access to about 7GBs of free disk space and he’s processing multi-gigabyte files.

I’m in a similar situation with something I’m working on. Processing isn’t done until the data lands in Tigris but I still need up to 80GBs of disk to do the processing so I need a “temporary” persistent volume.

mayailurus · June 2, 2025, 2:23am

Also, it’s a super-slow, overlay contraption, . The persistent volume will get up to 6×–16× more I/O bandwidth, if you use the performance class on the vCPU side.

(There’s a branch in the official docs repository giving exact numbers,† but people have observed this qualitatively in the forum before…)

†May still be somewhat draft status.

zienkikk · June 2, 2025, 3:16am

This is great info. Thank you @mayailurus.

infovore · June 2, 2025, 10:42am

Yes, this is helpful, and you’ve hit the nail on the head: in production, one-to-two users is enough to overwhelm the scratch disk on a Machine, so I’m going to need a bigger volume, or more volumes, and things are going to take time. And as volumes are only per-machine, I start needing quite a few volumes if I am going to also scale horizontally.

One thing I hadn’t twigged was the performance boost of using volumes - we’re moving these files around, doing a bunch of IO on them, and actually, improved IO would make a noticeable difference.

Exactly this.

zienkikk · June 3, 2025, 8:16pm

Looks like that commit was merged into the main docs: Fly Volumes overview · Fly Docs

That max bandwidth for the high-end instances looks extremely low, no?

mayailurus · June 3, 2025, 8:28pm

Yeah, it was merged just a bit ago. (Thanks, @dusty!)

That was my first reaction, too, but, on the other hand, they do have these shared among multiple VMs, still, .

VM Size	Max IOPs	Max Bandwidth
shared-cpu-1x	4000	16MiB/s
shared-cpu-2x	4000	16MiB/s
shared-cpu-4x	8000	32MiB/s
shared-cpu-8x	8000	32MiB/s
performance-1x	12000	48MiB/s
performance-2x	16000	64MiB/s
performance-4x	16000	64MiB/s
performance-8x	32000	128MiB/s
performance-16x	32000	128MiB/s

Either way, this might make a good top-level thread (“topic”) in its own right…

system · June 10, 2025, 8:38pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Scaling Ephemeral Disk Questions / Help	5	1365	March 14, 2022
Ephemeral Storage Questions / Help	1	789	May 7, 2022
Tigris - Performance, Quotas, and Limits storage , tigris	3	207	September 27, 2024
Distributed Volumes in Fly Questions / Help wishlist , docs , distributed , storage , volumes	10	150	April 7, 2025
One-time use containers Questions / Help machines	4	88	December 27, 2024

Help me reason about best practice for ephemeral machines with large volumes / storage requirements

Related topics