Best approach for burst storage

Hello there!

I intend to use fly.io for large file processing for my application, but in some cases files might not be suitable for streaming while processing, and thus I might need to store them in a file system accessible from a machine the application is running on. However, these files can also get quite large, and at peak the space taken might be almost twice as large. This could lead to needing 10 or even 20GB of space ready for the processing in some cases.

I only realistically need the storage for a couple of minutes, a segment of 15 minutes is already a pretty big stretch, but volumes are pro-rated per hour as far as I’ve seen in the documentation, which makes that feel like I would be paying for far longer storage usage than I need. On the other hand, as far as I can tell, rootfs of a machine can only grow to ~8GB of space (and is technically not even really meant for writing). Am I completely missing the right approach here? Or would I just have to go with persistent volumes anyway and try to manage them somehow to make the most use of the hour of space when it is emptied?

(Edit: Thinking about it, for now I am thinking in the direction of making new volumes when making a new machine, and then deleting the machine, but not yet the volume (as it should be paid off for an hour), which means another machine could start another processing job and use it, and then I can cancel abundant volumes after an hour, is there anything I can improve about this line of thinking?)

Any suggestions and help are appreciated, thank you for reading!

This is indeed the best approach, in my opinion. More broadly, think of the Machines themselves as being disposable. The platform overall is designed around that style, pretty much.

If your current fleet doesn’t fit the processing needs of the moment, then just destroy them and create new ones that do have the right CPU, disks, etc.

The only caveat is that it’s best to stay a little ahead of instantaneous demand, since Machine creation times have a ton of variance. Generally, it’s wise to have a few already on standby.

Also, the root filesystem is super-slow: throttled to ~8MB/s.

Recent commits to flyctl make it look like Fly.io might be planning some way to make it larger, but it still wouldn’t be suitable for processing large files…

Other things that people mention in this context are TigrisFS, NBD, and (now that’s been recently enabled) 9P, but those wouldn’t be great for 20GB bursts, I wouldn’t think.

1 Like

Thank you for your elaborate response, much appreciated! I find myself wondering now though;

What would you say should be considered instantaneous demand? How long creation times should I be expecting on the high end?

1 Like

That’s a good question, actually… Qualitatively, it’s too slow to be done interactively. (Not with much reliability, in my view.)

For example, if this was a video subtitling service, with the user uploading a file and then pressing Go, and getting reassuring “currently processing this scene” thumbnails as it progresses, then when you do receive the POST request at your end… it’s probably already too late to be creating the corresponding worker Machine now.

The “instantaneous demand” would be the number of workers strictly needed for the users who have already pressed Go, basically. More sophisticated capacity management would take into account signals about the shape of the near future, like historical statistics for this time of day, how many users have the submission page open but haven’t really done anything with it yet, etc., but even simplistic measures like “just keep some spares around” seem to yield gigantic improvements for people.

Moreover, the Machines platform concept itself revolves around pre-creation (so to speak), with auto-start (and now auto-unsuspend) intended for immediate demand response. (Waking up from suspend in particular is very fast. I wonder sometimes whether they’re singing the praises of that insufficiently, actually, now that it’s exited(?) its lengthy and mildly murky experimental phase.)

Fly.io has mentioned in the past that creation time depends a lot on the exact structure of your Docker image, and they’ve also been doing extensive work themselves behind the scenes on things like region-optimized image operations, so latencies have been getting better. (And, indeed, there are far fewer complaints lately in the forum about slow Machine creation.)

Still, I don’t think “ton of variance” itself is going out on a limb, :sweat_smile:. I see easily 10× discrepancy (on the same image) with my own Machines, and those are at the smaller end of the size spectrum.


I wish I also had quantitative feedback, actual numbers, to give you at this point, but I mainly only use Fly.io for ad hoc experiments myself. In addition to doing your own measurements, it might be fruitful to post a new top-level topic asking particularly about people’s recent Machine-creation statistics. There are some avid data hounds in the forum here, so odds are good that you’ll get some…

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.