Autosuspend is here! (+ Machine suspension is enabled everywhere)

MatthewIngwersen · July 24, 2024, 3:04pm

Three weeks ago, we announced suspend/resume support for Machines. Back then, we noted that there was no “autosuspend” feature analogous to autostop. No longer!

You can now ask our proxy to automatically suspend, rather than fully stop, your Machines when there is no traffic for them. Like autostop, you’ll save money by not running Machines when they aren’t needed, but your Machines will be able to resume from a snapshot to handle requests rather than doing a cold boot. This means that your users should get a much quicker experience when your Machines autostart.

A few caveats to be aware of:

If your Machines are not eligible for suspension—for example, if they exceed the memory limit for the feature—then we’ll fall back to autostopping them. (Check the original Fresh Produce for information about what Machines can be suspended right now.)
This is still a new feature, and we don’t have great data yet on how well this works for all apps/frameworks (e.g., how apps with persistent database connections will behave when they’re resumed). So give it a try, but with caution, and do let us know how it goes!

Additionally, we’ve now enabled Machine suspension in all regions, so you can suspend Fly Machines across the globe.

Enabling autosuspend

In fly.toml

This requires flyctl v0.2.94, which was released on 22 July 2024.

We’ve extended the auto_stop_machines field associated with services in fly.toml. In addition to the existing boolean values true and false, it now supports the following:

"off" (equivalent to false): your Machine will not be automatically stopped
"stop" (equivalent to true): your Machine will be fully stopped when idle
"suspend": your Machine will be suspended when idle

A quick example:

[http_service]
  internal_port = 8080
  force_https = true
  auto_stop_machines = "suspend"
  auto_start_machines = true

After updating your fly.toml, redeploy your app and enjoy!

Using `fly machines` commands

This requires flyctl v0.2.94, which was released on 22 July 2024.

The fly machines create, fly machines run, and fly machines update flags have an --autostop flag. In addition to the existing boolean values true and false, the flag now supports the values off, stop, and suspend with the same meanings as described above, e.g.:

fly machines run --autostop=suspend --port 80/tcp nginx

If you don’t specify a value (--autostop alone), it’s equivalent to --autostop=stop, as it was before.

Through the Machines API

We’ve extended the "autostop" field associated with services in Machine configurations. In addition to the existing boolean values true and false, it now supports the string values "off", "stop", and "suspend" with the same meanings as described above.

Here’s a full example of a typical services configuration for a Machine:

{
  "services": [
    {
      "protocol": "tcp",
      "internal_port": 8080,
      "autostop": "suspend",
      "autostart": true,
      "min_machines_running": 0,
      "ports": [
        {
          "port": 80,
          "handlers": [
            "http"
          ],
          "force_https": true
        },
        {
          "port": 443,
          "handlers": [
            "http",
            "tls"
          ]
        }
      ]
    }
  ]
}

As always, let us know what you think or if you have any questions!

khuezy · July 24, 2024, 3:48pm

This is… how do the youngin’s say? - lit yo !

Is it fair to say that this is equivalent to AWS’s snapstart? Works great so far for my nextjs and sveltekit app. I don’t have a persistent db connection (it uses http) so I can’t give feedback on that.

BTW, can we get an fly.io Configuration schemastore update.

kurt · July 24, 2024, 4:00pm

This is not quite like snapstart. My understanding of snapstart is that they boot a Firecracker with your code at deploy time, then snapshot.

If you run fly deploy against your app and you have suspended machines, we will stop them and then do the update and then leave them in a stopped state. There will be cold boot overhead when they first start after an update.

But there might be a quick feature we can ship here. It would make sense to let you start + suspend your machines at deploy time.

khuezy · July 25, 2024, 1:54pm

Yesterday when suspend was added, I noticed the resume time was around 250ms (pretty fast) and this morning, it’s around 100ms. Coincidence or was there some optimization?

catgirl · July 25, 2024, 10:16pm

How does LiteFS behave with suspends? Will the database sync with the primary after being resumed?

khuezy · July 25, 2024, 10:18pm

I started seeing ERROR stdout to vsock zero copy err: Broken pipe (os error 32) after tur;ning on autosuspend/resume. App still function so it might be benign.

pcmaster160 · July 26, 2024, 4:38am

Just wanted to chime in that support for machines larger than 1gb in RAM would be huge. Often the slowest apps to boot are the ones with a larger memory footprint. Hope you all are able to unlock that in the future.

XuluW · July 26, 2024, 10:05am

Even though the docs say that auto_start_machines defaults to true, I had to add an explicit auto_start_machines=true along with auto_stop_machines = "suspend" to make it actually restart the app on connection.

MatthewIngwersen · July 26, 2024, 4:01pm

Thanks all for the feedback and questions! A few responses I can give you off the top of my head:

We didn’t release any optimizations, so this sounds like natural variation. For example, how fast a Machine can get back up and running might depend on how long it takes to configure its network interface on the host side and how much of the Machine’s memory snapshot is still cached. We are thinking about how we can reduce the resume time further or at least make it more consistent, though!

This is a known issue (see also “Current limitations and caveats” in the original Fresh Produce post)—the Machine has to reconnect to a vsock to send logs after it’s resumed. Unfortunately, this means that a few log lines may be dropped immediately after resumption, but it should otherwise be benign, and improving this is on the to-do list.

We’d definitely like to raise it! We started with a 2 GiB limit to be safe, and because we currently write the entire contents of the Machine’s memory to disk each time it is suspended. Firecracker also supports “diff” snapshots that include only the memory that has been written since the last snapshot was taken, which should help make snapshots of large Machines less expensive.

reconbot2 · July 26, 2024, 8:16pm

Would it be possible to configure a signal or webhook that would be sent to my app on resume? I’d like to tell it to reconnect the database and wait for time to sync (somehow).

Additionally ensuring a healthy health check before routing requests would be ideal.

khuezy · July 28, 2024, 9:26pm

BTW, this blocks the request, eg when I load from a suspended state, my app responds almost immediately. Then when I refresh, the app hangs for about 3 seconds, spits out the ERROR stdout to vsock zero copy err: Broken pipe (os error 32) log, then the app responds.

xHomu · July 31, 2024, 9:11pm

Tried out auto-suspense on a couple apps.

Seems to be working well, much quicker than autostop! Noticing an oddity though where a machine remain suspended according to the Fly dashboard, but respond to requests just fine? (app: endfield)

A ghost in the machine?

miend · August 20, 2024, 5:21pm

Any idea whether a possible memory limit increase for this feature is on the near/mid-term horizon? I know y’all would like to, and I would love to test it out. Suspend/resume is killer cool, but most cases where I want to see it are >2GB RAM machines.

khuezy · September 23, 2024, 1:29am

@MatthewIngwersen Is there a signal that Fly emits before it suspends?

@reconbot2 Since suspended clocks get desynced when they resume, you can sort of hack detection of resuming machines.
eg in node:


const INTERVAL_TIME = 1000
let NEXT_TIME = Date.now() + INTERVAL_TIME

setInterval(() => {
  const now = Date.now()
  if (now < NEXT_TIME) {
     // My clock is out of sync, I probably just got unsuspended
  }
 NEXT_TIME = now + INTERVAL_TIME
}, INTERVAL_TIME)

It works pretty well for me, but it’s pretty hacky.

thiery · September 25, 2024, 2:41pm

@MatthewIngwersen I’m also interested in learning about any signals that are emitted before suspend?

ignoramous · September 25, 2024, 7:09pm

The docs note that auto-suspend can take “several minutes” to kick-in^[1] (apparently 5mins). I’d like to suspend much sooner! Ideally, if a machine is idle for 5s/10s/30s, I’d not mind it taken down. Any plans to let users configure this?

auto_stop_machines: The action, if any, that Fly Proxy should take when the app is idle for several minutes. ↩︎

khuezy · September 25, 2024, 7:18pm

It’s usually about 7 minutes for me. You can always use the machine api and call .../suspend to force it to suspend.

ignoramous · September 26, 2024, 6:28am

Thanks.

Wasn’t it 60s once? At least that’s what I thought when I removed the code that put machines to sleep (exit) if idle for 60s. And now, it is 5mins to 7mins… which seems too high for suspend specifically.

khuezy · October 20, 2024, 11:32pm

@MatthewIngwersen was last seen Aug 30, so I assume he is no longer with us. RIP .

I’ve been using autosuspend since it was available and unfortunately it doesn’t work well as I had hope.

There is a bug when waking up a machine. The first request responds almost immediately but the 2nd gets blocked for up to 3-4 seconds. This defeats the purpose of suspend.
The Linux kernel sucks at memory management, especially when there’s no swap available (you can’t use swap_memory_mb with suspend.)

Issue 1 isn’t a huge deal breaker, but issue 2:

What happens is the kernel will be clever and use as much pagecache as it thinks is optimal and leaves 2-10MB of free memory. Then when your app has a request and there isn’t enough memory, the kernel begins to panic… a simple request that usually takes 50ms ends up taking 10 seconds to respond.
Even if you increase vfs_cache_pressure or manually evict memory, it is either too late or it doesn’t even free up any of the pagecache.

By opting out of suspend and adding only 128MB of swap, my app (under the same memory usage) runs fine, only bottlenecked by CPU.

Just my little update, thanks for reading

ignoramous · October 21, 2024, 8:59pm

Given the recent CPU accounting changes (ref), I’d be very wary of relying on swap on shared-x machines as it can put a lot of the already depleted CPU slices to managing memory instead of doing actual work.

Topic		Replies	Views
New feature in preview: suspend/resume for Machines Fresh Produce machines	15	3542	October 26, 2024
Machine created via API is not suspending troubleshooting , machines , autoscaling	13	105	May 11, 2025
[hobby plan] deployed apps are suspended within minutes. how to fix?	6	4641	April 26, 2023
Automatically starting/stopping Apps v2 instances Fresh Produce	50	8555	November 24, 2024
App is being suspended even though I have auto_stop_machines set to "off" in fly.toml Questions / Help machines	3	106	April 21, 2025

Autosuspend is here! (+ Machine suspension is enabled everywhere)

Enabling autosuspend

In fly.toml

Using fly machines commands

Through the Machines API

Related topics

Using `fly machines` commands