Major Outage Portmortem (2021-10-13)

At about 7:30AM UTC on October 13th, the hosting provider OVH experienced a major outage. Substantial portions of our infrastructure are hosted in part on OVH servers, and so the OVH outage resulted in a cascading series of failures for Fly.io.

A rough timeline of the outage (all times UTC):

  • 7:30AM: Our API (new deploys), metrics, secrets, and logs are down. We begin mitigating the outage.
  • 8:30AM: Some connectivity in OVH is restored.
  • 9:30AM: All Fly.io apps restart.
  • 10:45AM: Metrics, logs, and API deploys begin functioning (but with degraded responsiveness).
  • 12:45PM: Most services are functioning, but API statuses are stale.
  • 2:30PM: Services functioning nominally.

Here’s what’s happened.

Our core orchestration stack – the “control plane” of our system – is currently HashiCorp Nomad, Vault, and Consul:

  • Nomad allocates micro-vms for Fly.io apps to specific worker hosts; Nomad agents run on all of our worker hosts, which span several hosting providers. The agents talk to a cluster of Nomad servers.
  • Vault manages both application secrets and TLS certificates for our Anycast CDN.
  • Consul keeps state for all running services in our network, and is a dependency for both Nomad and Vault. As with Nomad, Consul agents run on every server in our fleet, talking to a cluster of Consul servers.

Fly.io apps that are already running don’t depend continuously on any of these services; apps that are already running continue to run, and our Anycast CDN continues routing traffic for them. But all three services need to be healthy for new apps to be deployed.

At the beginning of the outage, the server clusters for Nomad, Vault, and Consul depend heavily on OVH. These servers are setup in three node, high availability configurations spanning two OVH datacenters. The OVH network outage took multiple datacenters offline simultaneously, and disrupted our Consul cluster; functionality in Fly.io that depends on Consul stops working.

OVH network recovers. Logs and metrics are restored. Consul, Nomad, and Vault are all still inaccessible so we start investigation as to why. Exacerbating the problem, we had days earlier begun a process of migrating Consul off of OVH. We have a Consul server cluster that is half deployed on OVH and half on NetActuate servers in Raleigh-Durham (RDU). The new RDU servers are several Consul revisions ahead of the OVH servers. As Consul is restored, OVH Consul servers are restarted; the newly re-forming Consul cluster now depends on that new version of Consul. We update our Consul DNS alias address to point to the RDU cluster while we update Consul on OVH to restore it there.

We update and restart Consul. We restart Nomad and Vault. We realise that our API is still having issues talking to Vault. After more investigation that takes some time we discover that our API is configured to talk to Vault on the same URL as our Consul alias, which now points at RDU. Vault is not yet running on RDU. Our control plane is functioning, but the API that drives it is not. We migrate Vault to RDU.

The RDU Vault problems are remediated. Deploys work through the API again. But the API is degraded: it reports incorrect statuses for running apps. That’s happening because a process that we run that syncs Nomad and Consul state to SQL (aptly called bubblegum ) has been confused by the outage; as Nomad is restored, bubblegum is receiving state updates in an unexpected order. A code change is required to resolve the resulting deadlock.

One final problem: the Nomad agent/server system is resilient to transitory failures of the Nomad server cluster. The Nomad servers can fail completely for some time and apps running under Nomad continue to run. But after a configurable amount of lost connectivity to Nomad servers, the agents restart their jobs automatically to restore a clean state for Nomad. We’ve left that configurable timeout at a default that makes sense for a typical Nomad deployment, but not for a global edge network. So midway through the outage, after the Nomad cluster heals and the agents notice, every Fly.io app restarts. Notably, this restart occurs at a point where Vault is not universally functional, so some apps fail to restart cleanly.

Once these issues were resolved most apps came back online cleanly. A small number required some intervention which we worked to identify and remediate.

This outage begins with a major upstream outage at OVH. But almost everything that goes wrong at Fly.io happens because our own systems have failed in unpredicted ways after that outage. We own this outage completely.

In a nutshell:

  • A large chunk of our infrastructure is taken offline for over an hour due to an upstream outage.
  • The outage occurs while we’re in the middle of a migration from that upstream, so our clusters don’t heal gracefully, and we lose two hours resolving that.
  • In the process of rapidly migrating one service (Vault) during that migration, we introduce a configuration problem that takes it offline for several additional hours.
  • For several further hours after the outage is remediated, our API reports faulty status (services appear down that are in fact up) because of a deadlock that occurs during the reboot of our Nomad cluster.
  • A Nomad configuration setting tuned for local clusters forces all customer apps to restart during the outage.

We’ve now migrated our entire hashicorp stack over to the new hosts, so there is no uncertainty about services running in some places but not others. Centralized services like that are inherently flawed for a global stack like ours however, and over the next few months we are working on new software that will be much more resilient to localized outages.

We have more to say about all of this, including more detail on what we’re doing moving forward, and particularly the roles that OVH, Nomad and Consul will continue to play in our system. For the moment, though, we just want to make sure people know what happened.

11 Likes

Thanks for the detailed writeup. From a customer point of view, are the following takeaways valid?

  • Once an app is deployed and running in multiple regions on Fly, there is no impact even if Fly’s command & control servers have problems (assuming the app is running outside of the problem regions). The anycast system ensures that requests will continued to be served from that edges once the instances are up and running.
  • Deploys and app creation will be affected by a control plane outage (this is expected, I guess).
  • Is Fly is now splitting / migrating the control plane across two datacenter providers (RDU + OVH) or just moving to RDU (not sure which).
  • Scaling will not work if the control plane is down.