Host Issue notifications

Hi Fly.io community :wave:

A few weeks ago we experienced some novel hardware failures that caused volumes attached to the affected hosts to become unrecoverable.

These hosts’ disks started failing during, or shortly after, a NATS cluster incident where a newly provisioned host reused an IP address from a decommissioned server. The IP reuse was not handled gracefully with the existing NATS configuration.

We run a NATS super cluster. In this topology we run a NATS cluster in each region, which all communicate with each other via a single ‘gateway’ connection from one of the nodes in the cluster. This gateway node is chosen at random and gossiped to other nodes. When the newly provisioned server came up, there was a mismatch in the state they held for cluster members and this new server - they saw this IP as belonging to a node in one region, but this new server was part of a different region. This caused a storm of errors and connection retries.

Our hypothesis is that the sustained I/O pressure surfaced latent hardware issues, which manifested simultaneously within a short span of time. We had several defunct NVMEs disks, RAM failures, flapping NICs, etc. There was even a host with a couple drive bays fried, to quote our provider remote hands “We took the server apart. Not a soft issue”.

For some of these hosts, it was enough to replace the failed NVMEs. For others, it was not worthwhile to repair them and they are now decommissioned; when possible, those hosts disks were transplanted to new systems.

When we need to perform maintenance activities or there are hardware failures like the ones above, we typically declare a “Host Issue”.

Generally host issues are resolved quickly. However, in the case of hardware issues with the hosts, they may be under maintenance for several days while our providers replace the faulty components.

If you have apps running on an affected host, the host issue is displayed to you in your Fly.io Dashboard.

Now, we’re also sending you email notifications!

The Infrastructure team released a notification process that will be used to email the organizations’ administrators about apps and volumes affected by a host issue, so you can take the appropriate actions to ensure your app’s availability.

We will keep iterating on our tooling to notify you about host issues, but it’s crucial to remember that hardware failures are inevitable. For better resilience, we recommend running your apps across multiple machines, especially if you are using volumes.

Let us know your feedback!

7 Likes

Thanks for this! My (possibly naive) question is this:

(edit: it turned in to an essay, sorry)

From a consumer perspective, I normally don’t hear about individual host failures nearly as much as I have had from Fly. It may be because of the increased level of insight you guys provide. But I’ve had single DO droplets running for years on single machines without a blip of downtime. Yet I have downtime regularly on Fly even with machines setup in HA: https://status.foundry.ac/ (this doesn’t include my TimescaleDB cluster that failed twice a few weeks ago)

My question is, why isn’t this as common on other providers? Do they have a system set up to auto failover to new hardware in a new zone if there is a host issue? Even though you guys clearly present in the docs to always set up multiple machines, clearly this is a cause of confusion for many customers. In my honest feedback, I know you can disable --ha on deploy…but sometimes too much flexibility affects user experience and expectations on the amount of downtime that can/has occurred. Instead of advertising the already insanely cheap, $2 deploy option, I almost think a starting cost of $4 but guaranteeing 2 machines running would help. It’s sort of like how I wish the tip % was included with the original cost of the food im purchasing instead of me having to make the decision of what % I should be paying the waiter/waitress. Years ago when started deploying on Fly, I certainly thought that I could expect the same reliability of hardware as I did coming from AWS/GCloud/DO by running a single machine.

I don’t know if I’m asking this correctly, but just seems like I’ve had to be a lot more “present” in managing my clusters since moving to Fly than I ever did beforehand. It’s never been terrible, but ceratainly a hassle at times.

I hope these are reasonable thoughts. Anyways, thanks for the continued improvements on notifications :pray:

2 Likes

Hi @uncvrd,

The relevant difference between Fly Volumes and other providers’ storage products such as Digital Ocean’s Volumes Block Storage (to take your example), or AWS EBS or Google Persistent Disk, etc., is that Fly Volumes are built on drives physically attached to servers, not a Storage Area Network placed elsewhere in the datacenter. SAN-based block storage can attach to different servers, which would allow VM instances to float more freely across servers and quickly recover from hardware failures - quietly reboot on another server and back up after a few minutes of interruption. Fly Volumes are a simpler, lower-level setup which makes them more cost-effective and higher-performance but also stickier to their physical server, making automatic migration/recovery more challenging.

That said, though individual-host issues are something to expect and to be prepared against, it’s worth noting that overall, hardware issues are relatively rare and only impact a tiny percentage of servers.

A HA setup should be good protection against individual-host hardware issues. In your case, it looks like your HA cluster was impacted by an unfortunate power outage that took out an entire datacenter in sjc - that kind of region-wide incident can take down even a HA app with multiple machines in a single region. Machines in multiple regions would be a more complete (and yes, more costly) protection, though datacenter-wide incidents are even more rare than individual-host issues, sjc was just a recent unfortunate outlier.

1 Like

Thanks for your response @wjordan I believe I understand, but what about stateless applications that don’t need to be tied to a volume though, would those be allowed to drift between servers for quick recovery?

Also, when you say “making automatic migration/recovery more challenging” for volumes, does this mean that this could technically be feasible in the future? As in something that may be possible to be tackled? Just curious (and hopeful)!

I dunno… sometimes I just feel like I see constant “goal post stretching”. When reading forums, I see the following sequence quite often:

“Looks like you’re deploying only a single machine”

Then

“Although somewhat rare for hardware failure, make sure you’re deploying multiple machines”

Then

“Although somewhat rare for a region to go down, make sure you’re deploying to multiple regions”

It’s this kind of confusion that frustrates some people (i.e. me lol). And also with the barrier of entry being so darn easy, I see a lot of new engineers frustrated with Fly because of this mismatch of simplicity to deploy and difficulty to maintain when they see verbiage like the above

lastly, for what it’s worth, basically all my Fly apps are either multi-region RAFT or stateless multi-region…my TimescaleDB Postgres cluster was in a single-region, and that is my fault

I wrote this back when it happened if it helps anyone in the future.

Appreciate your time!

1 Like

Yes, stateless apps automatically recover from individual-host failures by shifting workloads between two or more Machines on different servers, App Availability and Resiliency · Fly Docs has details on this. In short, for service-based apps, the default deploy is two Machines configured to auto-start and auto-stop, and for apps without services, the default deploy is an always-on Machine plus a standby machine that starts only if the paired Machine becomes unavailable.

On what might technically be feasible in the future for Volumes, see Bottomless S3-backed volumes for some experimental work we’ve done on Fly Volumes backed by durable object storage.

Fly apps come in many shapes and sizes- development apps are fine on a single machine, many production apps work best as a low-latency cluster of machines in a single region to protect against individual-host failures, and uptime-critical production apps might want to deploy machines in multiple regions with cross-region replication/failover to protect against region-wide failures. The tradeoffs between architectural choices depend on the particular app and the amount of availability risk it can tolerate, so there’s not a one-size-fits-all answer on what’s best. In any case I hear your confusion/frustration on this, and appreciate the feedback.

Is this something to worry about only if you’re using volumes?

Does machines without volumes automatically recover from such error?

I have a machine which I can’t remove due to unreachable host. Would be great if it could be removed :slight_smile:

Error: could not get machine e82d929c7e71e8: failed to get VM e82d929c7e71e8: request returned non-2xx status, 408 (Request ID: 01HXH8WXZ0V1QQBWCG5H2CWZ89-fra)