I’m thinking about some upcoming architecture. I’m working on some code that involves processing large (multi-gigabyte) image files.
Currently, I’m spinning up a meaty performance container on a request, grabbing a large file from Tigris, doing a bunch of tasks, writing data back to Tigris, and then letting scale-to-zero kick in.
This is fine for prototyping. As I think about moving to production, the requirements become more complex. One option is to queue all these requests, and ensure the number of simultaneous queue tasks is the same as the maximum scale of these machines. But that feels limiting, and not great for scale.
I’m likely to have a queue anyway, just to limit concurrency: these tasks are slow/complex, but performed infrequently, and I really don’t want to be running these machines when I don’t have to. At the same time… I was beginning to wonder if I’m just going to need n machines and n attached volumes to work as scratch space.
Perhaps the correct approach would be to dynamically create volumes via the API, though I’m not quite sure where the place to do that would be.
Anyhow, I’m thinking out loud, but also asking people who are more experienced with Fly-shaped infrastructure for suggestions.
Can you speak more to this theme? This sounds fine for production.
I do something similar with web crawlers. I have a web app that handles the user side, and a “distributor” app that handles crawler requests. Both of these are small and always-on. The distributor starts machines that self-terminate (and since I use machine create --rm it removes the historical existence of the machine). The parallel limit is enforced in the distributor, and based on the request, it can spin up a small or a large machine.
I wonder if my only error is that I’ve written a fair bit of queue logic that I should have used a third-party for, but it is mostly done now.
Perhaps the correct approach would be to dynamically create volumes via the API, though I’m not quite sure where the place to do that would be.
What I am doing here is sending the results back to a long-running machine, via an API, so the ephemeral machines do not need a volume. They only die once they have copied their results somewhere safe.
I’m watching this thread because I’m curious where it goes but thought I would chime in on one point of @halfer:
so the ephemeral machines do not need a volume
I would imagine @infovore needs a volume because without it, the root FS only has access to about 7GBs of free disk space and he’s processing multi-gigabyte files.
I’m in a similar situation with something I’m working on. Processing isn’t done until the data lands in Tigris but I still need up to 80GBs of disk to do the processing so I need a “temporary” persistent volume.
Also, it’s a super-slow, overlay contraption, . The persistent volume will get up to 6×–16× more I/O bandwidth, if you use the performance class on the vCPU side.
(There’s a branch in the official docs repository giving exact numbers,† but people have observed this qualitatively in the forum before…)
Yes, this is helpful, and you’ve hit the nail on the head: in production, one-to-two users is enough to overwhelm the scratch disk on a Machine, so I’m going to need a bigger volume, or more volumes, and things are going to take time. And as volumes are only per-machine, I start needing quite a few volumes if I am going to also scale horizontally.
One thing I hadn’t twigged was the performance boost of using volumes - we’re moving these files around, doing a bunch of IO on them, and actually, improved IO would make a noticeable difference.