Machines can now start from the suspending state

When a Fly Machine suspends, it can spend over 15 seconds in the suspending state (particularly if it has lots of memory). During this time, the machine is in limbo: it can’t handle requests, but it also can’t be started.

If your app only has a small number of machines, requests arriving while machines are suspending can suffer high latency or fail entirely because we can’t start a machine quickly enough to serve them. This is particularly painful if you use auto_stop_machines = suspend to automatically scale your app to zero.

We’ve fixed this with interruptible suspend: Fly Machines can now start directly from the suspending state. This works with both Fly Proxy’s auto_start_machines and via the Machines API.

How does it work?

The slowest part of suspend is copying a snapshot of the machine’s memory to disk. During this process, the Firecracker VM is paused to ensure a consistent memory snapshot, but the Firecracker process itself keeps running.

Now when a start request comes in during suspension, we cancel the memory copy operation and tell the existing Firecracker process to resume execution - as if the suspend never happened. No need to wait for suspension to complete or start a new Firecracker process.

8 Likes