Fly machine snapshot + restore?

Hi! Knowing that Machines is built on top of Firecracker, are there plans for supporting snapshotting a machine, restoring it at a later time, and/or cloning a snapshot to spawn new pre-heated machines?

It looks like Machines currently can be stopped and restarted, but when they are they’re rebooted. From the docs:

Machines that are restarted are completely reset to their original state so that they start clean on the next run.

Firecracker is capable of snapshotting+restoring guest memory, so programs running when the machine is paused are resumed from where they left off.

Pausing/resuming to save money when idle would be much more valuable for me if done with snapshots (rather than rebooting) because my servers have long start times, so starting from a fresh boot would be slow, whereas resuming from snapshot could be effectively transparent.

Further, being able to spawn new machines by cloning a saved snapshot would allow me to scale horizontally much more easily.

Cheers, I always look forward to seeing what you folks come up with next!

2 Likes

Agreed.

And yeah, that’s planned for sometime this year™️ VM Snapshot vs Volume Snapshot - #3 by kurt

1 Like

FWIW, this is the last thing that’s still keeping me on Lambda for some workloads.

Lifecycle of Lambda as far as I understand seems to be something along the lines of:

  1. Request comes in
  2. VM starts
  3. My code initializes
  4. My request handler runs
  5. VM pauses
  6. After a few minutes, VM stops

However, if another request comes in between 5 and 6, Lambda can resume the machine instead and bypass most of the cold start latency associated with both 2 and 3.

Customers are only billed for 3 and 4 (and possibly 2?), so keeping a fleet of lambdas “warm” with a bunch of concurrent dummy requests on a schedule can be a really nice sweet spot for latency/cost that AFAIK I can’t replicate with Fly at the moment (stopping a Fly machine and restarting it incurs the full cold start latency penalty of steps 2 and 3 every time).

Would really love to see this addressed so I can move even more stuff off of AWS!