Fly machine snapshot + restore?

jaredp · November 20, 2022, 6:46am

Hi! Knowing that Machines is built on top of Firecracker, are there plans for supporting snapshotting a machine, restoring it at a later time, and/or cloning a snapshot to spawn new pre-heated machines?

It looks like Machines currently can be stopped and restarted, but when they are they’re rebooted. From the docs:

Machines that are restarted are completely reset to their original state so that they start clean on the next run.

Firecracker is capable of snapshotting+restoring guest memory, so programs running when the machine is paused are resumed from where they left off.

Pausing/resuming to save money when idle would be much more valuable for me if done with snapshots (rather than rebooting) because my servers have long start times, so starting from a fresh boot would be slow, whereas resuming from snapshot could be effectively transparent.

Further, being able to spawn new machines by cloning a saved snapshot would allow me to scale horizontally much more easily.

Cheers, I always look forward to seeing what you folks come up with next!

ignoramous · November 20, 2022, 9:28am

Agreed.

And yeah, that’s planned for sometime this year™️ VM Snapshot vs Volume Snapshot - #3 by kurt

lewis · February 21, 2023, 3:13am

FWIW, this is the last thing that’s still keeping me on Lambda for some workloads.

Lifecycle of Lambda as far as I understand seems to be something along the lines of:

Request comes in
VM starts
My code initializes
My request handler runs
VM pauses
After a few minutes, VM stops

However, if another request comes in between 5 and 6, Lambda can resume the machine instead and bypass most of the cold start latency associated with both 2 and 3.

Customers are only billed for 3 and 4 (and possibly 2?), so keeping a fleet of lambdas “warm” with a bunch of concurrent dummy requests on a schedule can be a really nice sweet spot for latency/cost that AFAIK I can’t replicate with Fly at the moment (stopping a Fly machine and restarting it incurs the full cold start latency penalty of steps 2 and 3 every time).

Would really love to see this addressed so I can move even more stuff off of AWS!

Topic		Replies	Views
New feature in preview: suspend/resume for Machines Fresh Produce machines	15	3237	October 26, 2024
Difference between AWS Lambda and Fly Machines? Questions / Help	2	2600	March 14, 2023
Cloning a stopped machine?	5	673	October 4, 2022
Machine not starting automatically when receiving requests Questions / Help machines , autoscaling , proxy	4	96	October 18, 2024
How does fly.io calculate VM exec time? Questions / Help	2	722	May 3, 2022

Fly machine snapshot + restore?

Related topics