New feature in preview: suspend/resume for Machines

We know that making your Machines boot faster matters! After all, every second spent starting up your app is another second that your users have to spend waiting. Fly Machines already boot pretty quickly—fast enough to make automatic starts and stops work effectively. But still, it can easily take two seconds for a Machine running a typical Rails app to go from the stopped state to being ready to handle HTTP requests. It’s not an eternity, but there’s a lot of room for improvement.

Some of you may know that the hypervisor we use, Firecracker, allows you to “snapshot” a virtual machine. This means pausing it and dumping all of its state (including its memory) to persistent storage. Later, you can load the snapshot back into Firecracker, and your virtual machine will resume exactly where it left off, as if nothing had happened. (It’s even possible for network connections to stay intact if the other side doesn’t close them!) There’s no need to boot the Linux kernel or start up your app’s runtime, meaning that it can be much faster than rebooting.

We’re implementing this for Fly Machines! You’ll be able to suspend a Fly Machine, rather than stop it, and the next start will resume the Machine from the snapshot taken. Consequently, your app can be ready to serve new requests within a few hundred milliseconds.

We’re still iterating on this feature, but we’re excited to tell you that we’ve enabled it for you to try in a handful of regions to start:

  • Bogotá, Colombia (bog)
  • Guadalajara, Mexico (gdl)
  • Johannesburg, South Africa (jnb)
  • Bucharest, Romania (otp)
  • Phoenix, Arizona, United States (phx)

Suspending a Machine

From the CLI

There’s a new command introduced in flyctl v0.2.71 (released June 17):

fly machines suspend <ID>

Now, if you run fly machines status <ID>, you’ll either see that the Machine is in the suspended state, or perhaps that it is still in the process of suspending.

From the Machines API

Send a POST request to the new Machine suspension endpoint, documented here.

Calling this endpoint kicks off the suspension process, but it might take a few seconds to complete. The wait-for-state endpoint now accepts suspended as a target state if you’d like to wait for it to finish.

Resuming a suspended Machine

This one’s easy—start it as usual with fly machines start <ID> or the Machines API’s start endpoint. Machines in the suspended state will attempt to resume from a snapshot, and will fall back to a cold start if for some reason this isn’t possible.

Additionally, if you have automatic start enabled, then Fly Proxy will resume your suspended Machines when they are needed to handle incoming requests.

Forcing a suspended Machine to do a cold boot

You can use fly machines stop <ID> or the Machines API’s stop endpoint to convert a suspended Machine into a stopped one. The Machine’s snapshot will be thrown away, and the Machine will have to do a cold start the next time that it’s started.

Updating a suspended Machine

When you deploy your app, suspended Machines are treated as if they are stopped. Their snapshots are thrown away, and they’ll be cold-started with the updates that you’ve made.

:warning: Important: snapshots are disposable

We do not guarantee that a suspended Machine will ever resume from its snapshot; it’s possible that it will perform a cold start instead. For example, this may happen when we have to migrate a Machine to a different host to find space for it to run.

We do ensure that if a Machine performs a cold start, than any existing snapshot is invalidated. Put another way, a Machine cannot “go back in time” by resuming from a snapshot made before it last did a cold start.

We also ensure that a Machine cannot resume from a single snapshot more than once. Believe it or not, this is actually a security consideration! You can read the technical details over in Firecracker’s documentation.

Current limitations and caveats

There are some restrictions on what Machines can be suspended:

  • To be suspended, your Machines must have been updated since 20 June 2024 at 20:00 UTC. Don’t worry too much—it’ll tell you if this isn’t the case! Use fly machines update --yes <ID> or re-deploy your app if you run into this.
  • Machines must have 2 GiB of memory or less.
  • Machines must not have swap enabled.
  • Machines must not have a schedule.
  • Machines with GPUs cannot be suspended.

Furthermore, there are some rough edges to be aware of:

  • There is no “auto-suspend” feature analogous to auto-stop yet.
  • You will lose some log lines after a Machine is resumed.
  • When resumed, your Machine may take a few seconds to update its clock, so for the first few seconds it will think that it’s in the past.

We hope to address these soon!

Billing

For now, suspended Machines are billed just like stopped Machines.


Let us know here if you have any questions. We’re excited to see what you’ll build with this!

14 Likes

How long will we have to wait for this feature to replace auto-start/stop?

As soon as we can do so safely. There’s a few things we need to fix and test before the proxy can use it reliably.

1 Like

If this was possible would be amazing, auto-stop gpu machines are quite slow because loading the models in gpu take easily 30-40 seconds, if we could cut that time to a few seconds would be just incredible.

3 Likes

This feature would be a game changer for GPU machines, currently loading models in PyTorch is very slow for example

You could even use DMA to load GPUs memory directly from disk

Noted. We use cloud hypervisor instead of firecracker on our GPUs, so it’s not a matter of just enabling it. We’ll see what we can do.

1 Like