When a Fly Machine suspends, it can spend over 15 seconds in the suspending state (particularly if it has lots of memory). During this time, the machine is in limbo: it can’t handle requests, but it also can’t be started.
If your app only has a small number of machines, requests arriving while machines are suspending can suffer high latency or fail entirely because we can’t start a machine quickly enough to serve them. This is particularly painful if you use auto_stop_machines = suspend to automatically scale your app to zero.
We’ve fixed this with interruptible suspend: Fly Machines can now start directly from the suspending state. This works with both Fly Proxy’s auto_start_machines and via the Machines API.
How does it work?
The slowest part of suspend is copying a snapshot of the machine’s memory to disk. During this process, the Firecracker VM is paused to ensure a consistent memory snapshot, but the Firecracker process itself keeps running.
Now when a start request comes in during suspension, we cancel the memory copy operation and tell the existing Firecracker process to resume execution - as if the suspend never happened. No need to wait for suspension to complete or start a new Firecracker process.