Fixed: unreasonably slow resumes of suspended Machines

We fixed a bug that caused a small number of Fly Machines to take way too long to resume from the suspended state - sometimes over 30 seconds for tiny 256MB machines. If your machine was affected, resume times could get worse with each suspend/resume cycle until you rebooted.

Machines need to be updated for the fix to take effect. If you don’t want to deploy, you can make an arbitrary metadata change with fly machine update <machine-id> --yes --metadata foo=bar to force an update.

What was happening?

When a Fly Machine resumes from suspend, our supervisor (flyd) needs to confirm the machine is ready by making a request to the init process inside the VM over vsock (a virtual socket connection between host and guest). We wait for this status check to succeed before marking your machine as “started”.

Separately, flyd also opens a long-lived connection to init for CPU usage monitoring. vsock connections don’t survive being suspended - they’re reset when the machine is resumed.

Our init process had a bug where it was ignoring the connection reset events from the kernel for this long-lived connection on resume, and it never called close on the now-disconnected vsock.

Leaking a socket on each suspend/resume cycle is bad, but it also shouldn’t be a huge problem by itself (at least not until init hits the open files limit :scream:)

vsock connections work a little like TCP: they have source and destination port numbers that identify each connection. When a client opens a new connection, it selects a source port number for that connection. If you try to open a new vsock connection using a (source, destination) port pair that’s already in use on the destination, it’ll be rejected. A disconnected but unclosed socket is still considered in-use.

Firecracker allocates these source port numbers incrementally, always starting from the same fixed value (1073741824[1]), even on resume.

If the machine was unlucky (this didn’t always happen!), it could repeatedly receive the leaky long-lived connection as the first successful connection. It would eventually end up in a state where a large number of contiguous source port numbers starting from 1073741824 were all consumed by leaked connections.

For every contiguous leaked connection, the status check request would fail and need to retry (with backoff). This caused the delay in marking the machine as “started” to increase over time.

How we fixed it

We’ve fixed this in two places:

  1. init will properly close long-lived connections after suspend/resume.
  2. Firecracker now picks a random initial vsock port on startup, so even if a guest process is badly behaved, it’s very unlikely to hit a leaked connection.

  1. 2^30, if you were wondering ↩︎

11 Likes

I was affected and did not know what to conclude from that observation. Didn’t do anything just yet other than restarting affected machines from time to time.

Thank you so much for taking notice and implementing a fix!

And thank you also for the very informative and interesting read about the details of that bug. I always like reading the tech details of the fresh produce posts :slight_smile:

5 Likes

I noticed a significant increase in resume speed! Fly is finally doing what it promised it would do :joy:

Cold resume previously were painfully slow and unreliable.

1 Like

When a client complained about a slow machine start, I checked and found they were right. Resume from suspend sometimes outright failed it was taking so long.

Thankfully a quick search here found the solution!

1 Like