Hi Fly.io community
A few weeks ago we experienced some novel hardware failures that caused volumes attached to the affected hosts to become unrecoverable.
These hosts’ disks started failing during, or shortly after, a NATS cluster incident where a newly provisioned host reused an IP address from a decommissioned server. The IP reuse was not handled gracefully with the existing NATS configuration.
We run a NATS super cluster. In this topology we run a NATS cluster in each region, which all communicate with each other via a single ‘gateway’ connection from one of the nodes in the cluster. This gateway node is chosen at random and gossiped to other nodes. When the newly provisioned server came up, there was a mismatch in the state they held for cluster members and this new server - they saw this IP as belonging to a node in one region, but this new server was part of a different region. This caused a storm of errors and connection retries.
Our hypothesis is that the sustained I/O pressure surfaced latent hardware issues, which manifested simultaneously within a short span of time. We had several defunct NVMEs disks, RAM failures, flapping NICs, etc. There was even a host with a couple drive bays fried, to quote our provider remote hands “We took the server apart. Not a soft issue”.
For some of these hosts, it was enough to replace the failed NVMEs. For others, it was not worthwhile to repair them and they are now decommissioned; when possible, those hosts disks were transplanted to new systems.
When we need to perform maintenance activities or there are hardware failures like the ones above, we typically declare a “Host Issue”.
Generally host issues are resolved quickly. However, in the case of hardware issues with the hosts, they may be under maintenance for several days while our providers replace the faulty components.
If you have apps running on an affected host, the host issue is displayed to you in your Fly.io Dashboard.
Now, we’re also sending you email notifications!
The Infrastructure team released a notification process that will be used to email the organizations’ administrators about apps and volumes affected by a host issue, so you can take the appropriate actions to ensure your app’s availability.
We will keep iterating on our tooling to notify you about host issues, but it’s crucial to remember that hardware failures are inevitable. For better resilience, we recommend running your apps across multiple machines, especially if you are using volumes.
Let us know your feedback!