Fixed: MPG reporting as initializing when it's actually running

Last week we shipped a fix for a subtle bug where MPG clusters could show up as initializing even though their Fly Machines were already up and serving traffic.

What was going on?
We use Kubernetes pod UIDs to keep Fly Machines in sync with our internal Kubernetes state.

In our setup, “pods” are actually Fly Machines. Whenever Kubernetes creates a pod, it assigns it a new UID. Instead of always creating a brand new Fly Machine for each new UID, we often reuse an existing machine: we update it with the new configuration and store the UID in its metadata.

Inside our Virtual Kubelet, this reconciliation cycle happens every few seconds and can take two paths: full machine update, via the Update Machine endpoint or, when we decide that changing the metadata is enough, metadata-only update, via the Update Metadata endpoint.

In most cases, if we miss a metadata update, Kubernetes simply asks for it again in the next reconciliation cycle. The Virtual Kubelet sees that some annotation is missing and tries to patch it again, and things self-correct.

The UID annotation, however, is special.

We use UIDs to look up machines. On the next reconciliation cycle, the Virtual Kubelet will try to find a machine by the new UID. If the UID annotation on the machine was never updated, this lookup fails. From the Kubernetes point of view, it looks like the machine for that pod UID doesn’t exist yet, so we assume it’s still being created and keep waiting.

We found a case where, when updating multiple metadata keys for the same machine at once, we could silently drop one of them. If the one we dropped was the UID update, the system would never fix itself in future reconciliation cycles. The machine would be running, but we’d keep reporting it as initializing.

How we fixed it
We fixed the path where a metadata update could be silently dropped.

If updating the UID (or any other metadata) fails, the Virtual Kubelet retries until it succeeds, ensuring the machine and Kubernetes state stay in sync.

This prevents machines from getting “lost” behind an outdated UID and stops clusters from being stuck in initializing while already running.

When will this be available?
We rolled this out last week, and the fix is now live on all MPG clusters!

3 Likes