Help me reason about best practice for ephemeral machines with large volumes / storage requirements

I’m thinking about some upcoming architecture. I’m working on some code that involves processing large (multi-gigabyte) image files.

Currently, I’m spinning up a meaty performance container on a request, grabbing a large file from Tigris, doing a bunch of tasks, writing data back to Tigris, and then letting scale-to-zero kick in.

This is fine for prototyping. As I think about moving to production, the requirements become more complex. One option is to queue all these requests, and ensure the number of simultaneous queue tasks is the same as the maximum scale of these machines. But that feels limiting, and not great for scale.

I’m likely to have a queue anyway, just to limit concurrency: these tasks are slow/complex, but performed infrequently, and I really don’t want to be running these machines when I don’t have to. At the same time… I was beginning to wonder if I’m just going to need n machines and n attached volumes to work as scratch space.

Perhaps the correct approach would be to dynamically create volumes via the API, though I’m not quite sure where the place to do that would be.

Anyhow, I’m thinking out loud, but also asking people who are more experienced with Fly-shaped infrastructure for suggestions.

Can you speak more to this theme? This sounds fine for production.

I do something similar with web crawlers. I have a web app that handles the user side, and a “distributor” app that handles crawler requests. Both of these are small and always-on. The distributor starts machines that self-terminate (and since I use machine create --rm it removes the historical existence of the machine). The parallel limit is enforced in the distributor, and based on the request, it can spin up a small or a large machine.

I wonder if my only error is that I’ve written a fair bit of queue logic that I should have used a third-party for, but it is mostly done now.

Perhaps the correct approach would be to dynamically create volumes via the API, though I’m not quite sure where the place to do that would be.

What I am doing here is sending the results back to a long-running machine, via an API, so the ephemeral machines do not need a volume. They only die once they have copied their results somewhere safe.

I’m watching this thread because I’m curious where it goes but thought I would chime in on one point of @halfer:

so the ephemeral machines do not need a volume

I would imagine @infovore needs a volume because without it, the root FS only has access to about 7GBs of free disk space and he’s processing multi-gigabyte files.

I’m in a similar situation with something I’m working on. Processing isn’t done until the data lands in Tigris but I still need up to 80GBs of disk to do the processing so I need a “temporary” persistent volume.

2 Likes

Also, it’s a super-slow, overlay contraption, :dragon:. The persistent volume will get up to 6×–16× more I/O bandwidth, if you use the performance class on the vCPU side.

(There’s a branch in the official docs repository giving exact numbers,† but people have observed this qualitatively in the forum before…)

†May still be somewhat draft status.

This is great info. Thank you @mayailurus.

Yes, this is helpful, and you’ve hit the nail on the head: in production, one-to-two users is enough to overwhelm the scratch disk on a Machine, so I’m going to need a bigger volume, or more volumes, and things are going to take time. And as volumes are only per-machine, I start needing quite a few volumes if I am going to also scale horizontally.

One thing I hadn’t twigged was the performance boost of using volumes - we’re moving these files around, doing a bunch of IO on them, and actually, improved IO would make a noticeable difference.

Exactly this.

Looks like that commit was merged into the main docs: Fly Volumes overview · Fly Docs

That max bandwidth for the high-end instances looks extremely low, no?

Yeah, it was merged just a bit ago. (Thanks, @dusty!)

That was my first reaction, too, but, on the other hand, they do have these shared among multiple VMs, still, :thought_balloon:.

VM Size Max IOPs Max Bandwidth
shared-cpu-1x 4000 16MiB/s
shared-cpu-2x 4000 16MiB/s
shared-cpu-4x 8000 32MiB/s
shared-cpu-8x 8000 32MiB/s
performance-1x 12000 48MiB/s
performance-2x 16000 64MiB/s
performance-4x 16000 64MiB/s
performance-8x 32000 128MiB/s
performance-16x 32000 128MiB/s

Either way, this might make a good top-level thread (“topic”) in its own right…

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.