fly.io status: apps unreachable, unable to deploy

Again, I’m not defending Fly.io here, just saying we should all chill bros.
AWS has been a lot stable this past year, but over the last decade, I recall lots of production outages.

Again, I’m not defending Fly.io here

This is what you’ve been doing, however. None of the incidents in your link are about full-blown outages across all (at least 16 as mentioned somewhere above) regions, which happened here.

1 Like

Sure. Let’s not forget AWS us-east-1, if that goes down, it affects other regions too… which has happened a few times in my experience., possibly more.

Fly has already promised postmortems of all incidents resulting in “degraded service on our platform”, in the Infrastructure Log.

The catch is that it only comes out once a week. Moreover, since this event happened on a Sunday, it might actually be the week of the 10th (i.e., 7 days from now) that it’s covered.

If you look at older entries, however, you’ll see that they have at least a paragraph even for smaller, less noticed disruptions—like “Edge Capacity Saturated in DFW” and “Long Response Times From Metrics Cluster”.


This doesn’t mean that a quick, explicit note to that effect, bearing the Purple Balloon of Authority, wouldn’t be a good idea.

I broadly agree with @khuezy, though, that additional repetitions might veer into the counterproductive, :dove:

2 Likes

Nobody is going to be waiting a week for information on what happened here. This was a major incident for us, we’re doing a full-blown internal postmortem on what happened, it’s taking some time, and we’ll post up when we’re confident about it. This was a software bug with a complicated manifestation; it wasn’t random components throwing a rod or not handling byzantine network failure well.

4 Likes

(note this is an incomplete, preliminary summary in the interests of sharing some details in this thread; full incident reviews take time, we’re doing one, but we don’t want you all to wait any longer for the basics.)

On Sunday we had two things go wrong simultaneously, both impacting our global routing layer- the “core” outage lasted almost 45 minutes plus a related 3-minute interruption about an hour later.

The outage began at 19:19UTC. We had an infra responder within minutes, and an incident response team assembled by 19:25UTC.

This first issue was a software bug in fly-proxy, the (Rust-based) core of our Anycast routing layer. A week earlier, we’d introduced new fly-proxy code to handle an upcoming new feature; on Sunday, an app update with a particular service configuration suddenly triggered a nearly-fleet-wide deadlock in the proxy, stopping ~85% of traffic. The bug was a somewhat complicated combination of Tokio concurrency code and distributed state updates. Our investigation ruled out other possible network and load issues before narrowing in on connections being refused by deadlocked proxy instances. The proxy is designed to be easily restarted, and restarting it fleet-wide eventually cleared the bug around 20:00-20:05UTC.

We’ll have more in our full public incident review. It sucked.

The second issue we had was an interaction between fly-proxy and the (relatively recent) Linux tcp_migrate_req sysctl. This triggered on the fleet-wide proxy restarts and appeared for the first time during the incident. It impacted only a small portion (~1-2%) of our fleet and was resolved in the hours after the initial incident.

We’ll post a more complete review, including learnings and what we’re doing moving forward, on the Infra Log in the next couple days. The infra team reads these threads!

24 Likes

@wjordan thank you for the update, much appreciated

I’ve been watching Infra Log closely for an update, thank you for posting this :pray: