Autosuspend is here! (+ Machine suspension is enabled everywhere)

Three weeks ago, we announced suspend/resume support for Machines. Back then, we noted that there was no “autosuspend” feature analogous to autostop. No longer!

You can now ask our proxy to automatically suspend, rather than fully stop, your Machines when there is no traffic for them. Like autostop, you’ll save money by not running Machines when they aren’t needed, but your Machines will be able to resume from a snapshot to handle requests rather than doing a cold boot. This means that your users should get a much quicker experience when your Machines autostart.

A few caveats to be aware of:

  • If your Machines are not eligible for suspension—for example, if they exceed the memory limit for the feature—then we’ll fall back to autostopping them. (Check the original Fresh Produce for information about what Machines can be suspended right now.)
  • This is still a new feature, and we don’t have great data yet on how well this works for all apps/frameworks (e.g., how apps with persistent database connections will behave when they’re resumed). So give it a try, but with caution, and do let us know how it goes!

Additionally, we’ve now enabled Machine suspension in all regions, so you can suspend Fly Machines across the globe.


Enabling autosuspend

In fly.toml

:warning: This requires flyctl v0.2.94, which was released on 22 July 2024.

We’ve extended the auto_stop_machines field associated with services in fly.toml. In addition to the existing boolean values true and false, it now supports the following:

  • "off" (equivalent to false): your Machine will not be automatically stopped
  • "stop" (equivalent to true): your Machine will be fully stopped when idle
  • "suspend": your Machine will be suspended when idle

A quick example:

[http_service]
  internal_port = 8080
  force_https = true
  auto_stop_machines = "suspend"
  auto_start_machines = true

After updating your fly.toml, redeploy your app and enjoy!

Using fly machines commands

:warning: This requires flyctl v0.2.94, which was released on 22 July 2024.

The fly machines create, fly machines run, and fly machines update flags have an --autostop flag. In addition to the existing boolean values true and false, the flag now supports the values off, stop, and suspend with the same meanings as described above, e.g.:

fly machines run --autostop=suspend --port 80/tcp nginx

If you don’t specify a value (--autostop alone), it’s equivalent to --autostop=stop, as it was before.

Through the Machines API

We’ve extended the "autostop" field associated with services in Machine configurations. In addition to the existing boolean values true and false, it now supports the string values "off", "stop", and "suspend" with the same meanings as described above.

Here’s a full example of a typical services configuration for a Machine:

{
  "services": [
    {
      "protocol": "tcp",
      "internal_port": 8080,
      "autostop": "suspend",
      "autostart": true,
      "min_machines_running": 0,
      "ports": [
        {
          "port": 80,
          "handlers": [
            "http"
          ],
          "force_https": true
        },
        {
          "port": 443,
          "handlers": [
            "http",
            "tls"
          ]
        }
      ]
    }
  ]
}

As always, let us know what you think or if you have any questions!

18 Likes

This is… how do the youngin’s say? - lit yo :fire: !

Is it fair to say that this is equivalent to AWS’s snapstart? Works great so far for my nextjs and sveltekit app. I don’t have a persistent db connection (it uses http) so I can’t give feedback on that.

BTW, can we get an fly.io Configuration schemastore update.

1 Like

This is not quite like snapstart. My understanding of snapstart is that they boot a Firecracker with your code at deploy time, then snapshot.

If you run fly deploy against your app and you have suspended machines, we will stop them and then do the update and then leave them in a stopped state. There will be cold boot overhead when they first start after an update.

But there might be a quick feature we can ship here. It would make sense to let you start + suspend your machines at deploy time.

3 Likes

Yesterday when suspend was added, I noticed the resume time was around 250ms (pretty fast) and this morning, it’s around 100ms. Coincidence or was there some optimization?

How does LiteFS behave with suspends? Will the database sync with the primary after being resumed?

I started seeing ERROR stdout to vsock zero copy err: Broken pipe (os error 32) after tur;ning on autosuspend/resume. App still function so it might be benign.

Just wanted to chime in that support for machines larger than 1gb in RAM would be huge. Often the slowest apps to boot are the ones with a larger memory footprint. Hope you all are able to unlock that in the future.

1 Like

Even though the docs say that auto_start_machines defaults to true, I had to add an explicit auto_start_machines=true along with auto_stop_machines = "suspend" to make it actually restart the app on connection.

Thanks all for the feedback and questions! A few responses I can give you off the top of my head:

We didn’t release any optimizations, so this sounds like natural variation. For example, how fast a Machine can get back up and running might depend on how long it takes to configure its network interface on the host side and how much of the Machine’s memory snapshot is still cached. We are thinking about how we can reduce the resume time further or at least make it more consistent, though!

This is a known issue (see also “Current limitations and caveats” in the original Fresh Produce post)—the Machine has to reconnect to a vsock to send logs after it’s resumed. Unfortunately, this means that a few log lines may be dropped immediately after resumption, but it should otherwise be benign, and improving this is on the to-do list.

We’d definitely like to raise it! We started with a 2 GiB limit to be safe, and because we currently write the entire contents of the Machine’s memory to disk each time it is suspended. Firecracker also supports “diff” snapshots that include only the memory that has been written since the last snapshot was taken, which should help make snapshots of large Machines less expensive.

5 Likes

Would it be possible to configure a signal or webhook that would be sent to my app on resume? I’d like to tell it to reconnect the database and wait for time to sync (somehow).

Additionally ensuring a healthy health check before routing requests would be ideal.

1 Like

BTW, this blocks the request, eg when I load from a suspended state, my app responds almost immediately. Then when I refresh, the app hangs for about 3 seconds, spits out the ERROR stdout to vsock zero copy err: Broken pipe (os error 32) log, then the app responds.

Tried out auto-suspense on a couple apps.

Seems to be working well, much quicker than autostop! Noticing an oddity though where a machine remain suspended according to the Fly dashboard, but respond to requests just fine? (app: endfield)

A ghost in the machine?

1 Like

Any idea whether a possible memory limit increase for this feature is on the near/mid-term horizon? I know y’all would like to, and I would love to test it out. Suspend/resume is killer cool, but most cases where I want to see it are >2GB RAM machines.

@MatthewIngwersen Is there a signal that Fly emits before it suspends?

@reconbot2 Since suspended clocks get desynced when they resume, you can sort of hack detection of resuming machines.
eg in node:


const INTERVAL_TIME = 1000
let NEXT_TIME = Date.now() + INTERVAL_TIME

setInterval(() => {
  const now = Date.now()
  if (now < NEXT_TIME) {
     // My clock is out of sync, I probably just got unsuspended
  }
 NEXT_TIME = now + INTERVAL_TIME
}, INTERVAL_TIME)

It works pretty well for me, but it’s pretty hacky.

@MatthewIngwersen I’m also interested in learning about any signals that are emitted before suspend?

The docs note that auto-suspend can take “several minutes” to kick-in[1] (apparently 5mins). I’d like to suspend much sooner! Ideally, if a machine is idle for 5s/10s/30s, I’d not mind it taken down. Any plans to let users configure this?


  1. auto_stop_machines: The action, if any, that Fly Proxy should take when the app is idle for several minutes. ↩︎

It’s usually about 7 minutes for me. You can always use the machine api and call .../suspend to force it to suspend.

1 Like

Thanks.

Wasn’t it 60s once? At least that’s what I thought when I removed the code that put machines to sleep (exit) if idle for 60s. And now, it is 5mins to 7mins… which seems too high for suspend specifically.

@MatthewIngwersen was last seen Aug 30, so I assume he is no longer with us. RIP :headstone: .

I’ve been using autosuspend since it was available and unfortunately it doesn’t work well as I had hope.

  1. There is a bug when waking up a machine. The first request responds almost immediately but the 2nd gets blocked for up to 3-4 seconds. This defeats the purpose of suspend.
  2. The Linux kernel sucks :cherries: at memory management, especially when there’s no swap available (you can’t use swap_memory_mb with suspend.)

Issue 1 isn’t a huge deal breaker, but issue 2:

What happens is the kernel will be clever and use as much pagecache as it thinks is optimal and leaves 2-10MB of free memory. Then when your app has a request and there isn’t enough memory, the kernel begins to panic… a simple request that usually takes 50ms ends up taking 10 seconds to respond.
Even if you increase vfs_cache_pressure or manually evict memory, it is either too late or it doesn’t even free up any of the pagecache.

By opting out of suspend and adding only 128MB of swap, my app (under the same memory usage) runs fine, only bottlenecked by CPU.

Just my little update, thanks for reading :open_book:

Given the recent CPU accounting changes (ref), I’d be very wary of relying on swap on shared-x machines as it can put a lot of the already depleted CPU slices to managing memory instead of doing actual work.

1 Like