Fixing a fun vsock leak on machine resume in a Rust `init` program

This is a companion to @rian’s great writeup on the fix for slow machine resumes; it’s about the other half of the story – how we ended up leaking vsocks inside machines and how that was fixed. It’s pretty heavy on Rust and async I/O internals.

The Premise

If you haven’t read @rian’s Fresh Produce – you should! – but here’s a quick recap:

  1. We noticed customer reports about machine resumes getting slower and slower over time
  2. After chasing down a bunch of false leads we eventually narrowed it down to an issue with host → machine communication over a vsock that was timing out
  3. It turns out that we have long-running vsock connections due to HTTP Keep-Alive; when a machine suspends, these connections are supposed to be closed
  4. For one reason or another, the guest init (a custom Rust program we ship with all machines) does not close the socket even after resume.
  5. As time goes on these “zombie” sockets consume more and more vsock ports, and since vsock ports are searched sequentially, this means we need more and more time to search for an available port, causing resumes to be slow.

@rian has worked around this bug by making the port search random instead of sequential, making it diminishingly unlikely that we’re ever going to hit a port currently consumed by a leaked socket. This leaves the other half of the problem: we still need to figure out why these sockets are leaked in the first place!

Tracking Sockets Down the Rust Async I/O Stack

Our init program in machines is written in Rust with its Tokio async I/O runtime. For the HTTP API endpoints we serve over vsock, we use warp which in turn serves HTTP traffic using hyper, one of Rust’s most popular asynchronous HTTP stacks. It can be daunting to try to track down resource leakage across this many layers. However, before suspecting anything super contrived, it’s always helpful to start from the basics –

Rust’s async I/O is based on a few building blocks: Futures, and the underlying syscalls that power async event loops. A Future is just an arbitrary structure representing an action that can be polled without blocking.

Futures are useless without a mechanism to notify the runtime when to poll them. In our case, Tokio uses epoll under the hood to drive execution of Futures spawned onto its runtime. Tokio registers an Interest on READABLE and WRITABLE events for each socket it manages, which translates to EPOLLIN and EPOLLOUT events in the case of epoll. Very roughly speaking, a Future is polled if its underlying socket(s) get returned by epoll with flags EPOLLIN or EPOLLOUT, for reads and writes respectively.

With this in mind, here’s one dead-simple possible cause for a leaked socket: what if for whatever reason the vsock fd never emits an event after a machine is resumed? That would mean any Futures using it are never polled again to observe the disconnected vsock. If no code observes the disconnection, nothing will close the fd, either, causing a leak.

An Unsuccessful Dirty Fix

We didn’t know what might prevent the Future from being polled at this point, but we had a very, very dirty fix in mind: we can try to just force a poll on a timer! This can be done by simply wrapping the original I/O object in a custom struct and calling poll on an Interval struct within the original poll functions performing I/O, something like

#[pin_project]
struct MyIoWrapper {
    #[pin]
    inner: VsockStream,
    #[pin]
    interval: Interval // Set this to a couple seconds
}

// ...

impl AsyncRead for MyIoWrapper {
    fn poll_read(
        self: Pin<&mut Self>, cx: &mut Context<'_>, buf: &mut ReadBuf<'_>,
    ) -> Poll<std::io::Result<()>> {
        let _self = self.project();
        _ = _self.interval.poll(cx); // This causes the outer poll to also be called on the interval
        _self.inner.poll_read(cx, buf)
    }
}

// ...and do this for write too

By calling interval.poll() inside our poll function, it causes our poll to be called every time the Interval object fires as well. If the underlying file descriptor of the VsockStream is going to return an error, this should cause that error to surface.

Unfortunately, this did nothing to fix our bug. Which means one of two things:

  1. read/write syscalls do not return errors for vsocks that got disconnected due to suspend/resume; so, even though poll_read and poll_write might get called, they don’t observe an error
  2. There is no epoll event of type EPOLLIN or EPOLLOUT at all for these leaked sockets. VsockStream from tokio-vsock also checks readiness flags on poll, so if these flags were never set, they will never call read/write syscalls. Usually, for TCP and UDP sockets, any error will also set EPOLLIN and EPOLLOUT (depending on the registered interest) in addition to EPOLLERR, since now calling read or write on the socket will not block (it’ll error!).

What’s Going On?

At this point, @rian decided to dig into this a bit more using strace to observe epoll_wait returns. What he found is that when a machine is resumed with a currently-active vsock connection, epoll_wait emits an event with the following flags set: EPOLLOUT | EPOLLERR.

This looks innocent until we thought about it a bit more. For the majority of these HTTP connections under Keep-Alive, the server side (i.e. the init process inside machine) should be waiting on a read. HTTP is mainly driven by the client, and so long as the client does not send another request, the server can only wait for more data. But if only EPOLLOUT | EPOLLERR are set in this case, the reader side would not observe an event of interest, even though Interest::READABLE and Interest::WRITEABLE (i.e. EPOLLIN | EPOLLOUT) are always registered for each socket.

In our opinion, this is why the socket might have gotten stuck. We might just have a bunch of Futures all waiting on a vsock to become readable, but in fact they have already errored. Even though we also tried to force a periodical poll in poll_write, it does not help if there is no writer in the queue waiting at all in the first place.

(Side note: we are not sure what the correct behavior of vsocks should be in this case, but this sounds like a kernel problem; userspace should be notified on both the read and write halves if the other end of the socket is completely gone. Or maybe this is something specific about how Firecracker handles suspends – we have not figured out exactly why yet.)

Wrangling Tokio for EPOLLERR

The fix feels like something simple: we should just make everything also listen on EPOLLERR! Unfortunately, it’s not that easy: although EPOLLERR is always reported, tokio does not check Interest::ERROR if you are only polling for read or write. This is why the read side doesn’t get to proceed if only EPOLLERR is observed. We would need to make both the read and write side poll on EPOLLERR as well.

A wrapper like what we did with the timer hack above could work. Except, tokio does not expose a polling equivalent for the AsyncFd::ready() function. This function itself is async, meaning it returns another Future. We can’t use poll on such a Future in our poll_read and poll_write wrappers, because this Future contains reference to the VsockStream’s underlying fd. As a result, if we created this future, we can’t easily store it inside the same wrapper struct holding VsockStream without wrangling with unsafe Rust, as that would become self-referential, which isn’t exactly compatible with Rust’s lifetime model (it can be made to work using Pin; that’s how Futures are themselves often implemented, but it’s a pain to deal with by hand).

(As a side note: This is a classic problem one might run into with Rust’s async/await; poll-like methods are low-level and harder to write, but they only require temporary access to the underlying resources. Once you wrap poll methods into Future objects, they have to hold longer-living references, and that makes them hard to deal with if you also need to own the original object at the same time.)

In general, it’s not very easy to add a new Interest onto an existing tokio I/O object. Tokio itself doesn’t expose an easy way to do that, and hacking it by wrapping poll_* methods will always run into a situation like the above. In fly-proxy, another big internal Rust project of ours, we have also had to deal with this multiple times before.

We could always make changes to tokio and tokio-vsock to fix this. Right now, though, to get a quick fix out, we’re going to have to use a hacky workaround. This is the same workaround we used in fly-proxy before: instead of trying to add a new Interest::ERROR into an existing AsyncFd-based object, we simply dup() the underlying fd, create a dedicated AsyncFd for that (which would have a 'static lifetime!), then register it with Interest::ERROR. We can then easily build a Future on top of that which runs a loop to wait on fd.ready(Interest::ERROR). This Future will inherit the 'static lifetime, and can be easily stored inside the wrapper struct. We drive this Future by calling its poll method from poll_read and poll_write, just like what we did with the timer.

To wrap it up, what we ended up with is something like

#[pin_project]
struct MyIoWrapper {
    #[pin]
    inner: VsockStream,
    #[pin]
    err_fut: BoxFuture<'static, std::io::Error>
}

impl MyIoWrapper {
    pub fn wrap(inner: VsockStream) -> std::io::Result<MyIoWrapper> {
        // Clone the fd so that we can register the ERROR interest.
        let dup = AsyncFd::with_interest(
            inner.as_fd().try_clone_to_owned()?,
            Interest::ERROR,
        )?;

        // This future resolves to a [std::io::Error] if the underlying socket emits an error event
        let err_fut = async move {
            loop {
                let mut ready_guard = match dup.ready(Interest::ERROR).await {
                    Ok(guard) => guard,
                    Err(e) => return e,
                };

                if ready_guard.ready().is_error() {
                    // get_socket_error is a simple wrapper over `getsockopt` with `SO_ERROR`
                    match get_socket_error(&dup) {
                        Some(e) => return e,
                        None => {
                            ready_guard.clear_ready_matching(Ready::ERROR);
                        }
                    }
                }
            }
        }
        .boxed();

        Ok(MyIoWrapper { inner, err_fut })
    }
}

impl AsyncRead for MyIoWrapper {
    fn poll_read(
        self: Pin<&mut Self>, cx: &mut Context<'_>, buf: &mut ReadBuf<'_>,
    ) -> Poll<std::io::Result<()>> {
        let _self = self.project();

        // If err_fut resolves (returns), it means we've seen an error
        // we need to do this because `poll_read` may not actually call `read` if there's no EPOLLIN
        // so we can't rely on it to bubble errors
        if let std::task::Poll::Ready(err) = _self.err_fut.poll(cx) {
            return std::task::Poll::Ready(Err(err));
        }

        _self.inner.poll_read(cx, buf)
    }
}

// ...and do this for write too

This fix has been rolled out earlier; new or updated machines after today should not experience the same vsock leakage that caused the original suspend/resume bug.

6 Likes

Just today had to manually update metadata to restart 4 machines stuck in starting phrase.

Also I’m still seeing this in logs just above ‘Machine started in XXXms’ message :

[info]thread ‘main’ panicked at libinit/src/vsock.rs:88:23:
[info]`async fn` resumed after completion

1 Like

Sorry! This looks like an edge case that happens if your machine got suspended in the middle of a vsock request. I’ve just deployed a fix, though you might need to update / redeploy the machine to get the update.

1 Like