Hi! Knowing that Machines is built on top of Firecracker, are there plans for supporting snapshotting a machine, restoring it at a later time, and/or cloning a snapshot to spawn new pre-heated machines?
It looks like Machines currently can be stopped and restarted, but when they are they’re rebooted. From the docs:
Machines that are restarted are completely reset to their original state so that they start clean on the next run.
Firecracker is capable of snapshotting+restoring guest memory, so programs running when the machine is paused are resumed from where they left off.
Pausing/resuming to save money when idle would be much more valuable for me if done with snapshots (rather than rebooting) because my servers have long start times, so starting from a fresh boot would be slow, whereas resuming from snapshot could be effectively transparent.
Further, being able to spawn new machines by cloning a saved snapshot would allow me to scale horizontally much more easily.
Cheers, I always look forward to seeing what you folks come up with next!
FWIW, this is the last thing that’s still keeping me on Lambda for some workloads.
Lifecycle of Lambda as far as I understand seems to be something along the lines of:
Request comes in
VM starts
My code initializes
My request handler runs
VM pauses
After a few minutes, VM stops
However, if another request comes in between 5 and 6, Lambda can resume the machine instead and bypass most of the cold start latency associated with both 2 and 3.
Customers are only billed for 3 and 4 (and possibly 2?), so keeping a fleet of lambdas “warm” with a bunch of concurrent dummy requests on a schedule can be a really nice sweet spot for latency/cost that AFAIK I can’t replicate with Fly at the moment (stopping a Fly machine and restarting it incurs the full cold start latency penalty of steps 2 and 3 every time).
Would really love to see this addressed so I can move even more stuff off of AWS!