Tips for avoiding outages after Apr 27, 2024 incident

We are well into a migration from Google Kubernetes Engine to Fly, and while we are really happy with Fly so far (less than a month), I must say that in the five or so years we were with GKE, we never experienced an outage due to something Google/GKE platform-related.

I totally understand that outages are inevitable, but I thought that deploying apps across regions would provide some protection, but one of our apps is deployed across DFW and ORD and it still went down.

Besides deploying apps across more regions, are there any other tips for avoid outages? Are there recommended regions to which we should be deploying for the best resiliency?

1 Like

It’s unfortunately something you’ll just have to deal with when using Fly.io. I’ve been a customer for >3 years and reliability hasn’t really improved (from an end user perspective). When it works, it’s great. They’re aware of the reliability issues of course, and we can only hope we see improvements going forward :confused:

1 Like

Sydney has been pretty reliable, haven’t had an outage that takes down our servers in a long time. As with anything, resilience in your architecture design is best for mitigating individual server outages.

1 Like

The more recent incident was affecting whole of Fly (Fly.io Status - Elevated errors and connectivity problems). Since that’s like one of the worst cases, I don’t think there’s anything you could do to avoid outages on your side if you want to keep your apps on Fly.

FWIW I haven’t really had an outage in the last 12 months on Fly (except the most recent one), and I’m running an app that serves over 200M requests per month across 2-3 regions (cdg, waw and iad).

I really hope though that Fly is working on improving the stability of their infrastructure to squeeze out the last percentage in terms of uptime.

2 Likes

I do have other apps in DFW and ORD, but only one was affected, so Fly.io Status - Elevated errors and connectivity problems was unique.

Interesting, good to know! I thought Fly had a complete outage in all regions, because I couldn’t even reach Fly itself or any of their other services like Grafana.

We’re also ramping up our usage of Fly and, while yesterday was the worst instability we’ve observed, it wasn’t the first over our (relatively) short time PoCing on the platform.

Fwiw, here are some steps we’ve taken to better insulate ourselves:

  1. Fly doesn’t own our edge anymore. I flipped Cloudflare from DNS-only to proxied. At least for our static and static-ish content, we now have some cache controls to insulate from misbehaving fly proxy and wireguard network woes (big culprits lately)

  2. More standby machines w/ minimum running instances > 0. The scale-to-zero is a nice cost saving measure, but fly proxy has failed to scale suspended machines on us several times now. I wish we had more control over idle instance placement, but at least having a few machines always on means fly proxy can misbehave a little and we’ll still service low traffic load.

  3. Blue-green deployment strategy. Rather than in-place, we now force provisioning of new machines when rolling a new app version. In case of platform instability during a deployment, healthy machines are left alone. Even if we can’t bring up a new version, at least we don’t simultaneously sack the existing one.

All that said, I too am concerned with yesterday’s outage and general platform instability. Data planes go down. It’s our job to architect our apps to be resilient to that fact. Control planes go down too; however, it took hours before we observed (near) full recovery, with lots of manual intervention on our part to try and stabilize our (very low risk) pilot apps. I’m hoping we see a public retro for this one—that’d go a long way in easing my mind.

Cheers and good luck :beers:

4 Likes

Reliability is starting to feel like a real issue.

I would like to see a statement from their team that they are prioritizing it.

The developer experience of Fly has been great, but the service reliability is becoming a deal breaker and is making it hard to recommend for large scale production apps.

Hmmm…I didn’t notice any disturbances in any user metrics over the past 48 hours.

That being said, the app is running in 9 regions, over-provisioned, has no autoscaling, and is using sqlite for all read-queries locally. Running prod on fly since 2022-08, it has gotten better, but still somewhat adventurous.

What happened during the incident? I’m not sure what “elevated errors across fly” means. Was this issues deploying or issues serving traffic?

Hey @marcandrews and others,

We hope to eventually provide a more detailed public writeup of the two incidents that occurred this past week on Apr 24 and Apr 27. For now, a brief summary is that a centralized configuration change triggered an error in a service maintaining our Wireguard mesh on each host, which set off a chain reaction eventually causing host network interfaces to start flapping, resulting in global network instability. Most hosts stopped flapping their interfaces and recovered on their own once we reverted the change, though a small number of hosts got stuck offline and required manual intervention.

The Apr 27 incident additionally triggered an issue with NATS-cluster registration that indirectly impacted Vector log collection and our Machine orchestrator across a large number of hosts, causing Machine starts to fail until we resolved this additional issue.

Some kinds of outages are indeed inevitable: disks fail, servers crash, datacenters have power outages or network disruptions beyond our control. Deploying apps across multiple regions does provide protection against these kinds of inevitable failures. Global incidents triggering correlated failures across regions are harder to protect against. Deploying more machines across more regions can reduce the impact of some kinds of global failures on your app, where some machines are randomly less impacted than others (as was the case this time), or if the impact is spread out over time.

Ultimately though, it’s on us to provide a reliable platform that minimizes the frequency and duration of any kind of correlated, global failures, and I’m sorry we let you down here. We’re working to examine these incidents down to the tiniest detail and fortify every single weak link they exposed, so that we can continue building a more reliable platform moving forward.

6 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.