PSA: Postmortem for the Nov 25 Outage

mayailurus · December 5, 2024, 9:00pm

Updates pad in to the Infrastructure Log on little cat feet, , which perhaps overall is the timbre of that part of the Fly.io blog / information machinery. (It doesn’t have an RSS feed, for example.) Anyway, this latest one does contain the widely anticipated postmortem for the November 25 global API and deployments outage:

https://fly.io/infra-log/#november-25-outage-postmortem

(Typically, these quietly appear weekly, covering “a superset both of our status page events and of customer-impacting events on the platform”, although occasionally two consecutive installments get coalesced.)

Appendix: Original forum threads and preliminary status notes of the time…

*Not strictly the same incident, apparently.

kyleatcausadix · December 6, 2024, 12:53am

Thank you for posting the PSA!

I’m guessing I’m not the only one on here with a desire to talk about the postmortem, so I’ll throw a post below. If you’d prefer I start another topic tho, rather than hijack this one, please say so and I’ll migrate!

kyleatcausadix · December 6, 2024, 1:51am

Thoughts

IMO the postmortem is well-written and details how specific pieces of software that power Fly’s platform affected other specific pieces of software and hardware that also power Fly’s platform.

I also appreciate the transparency, detail, and clarity in the Forward-Looking Statements section.

That said, I can’t shake this feeling that I’m being… gaslit?

This was a global outage. Switches were saturated. The API was down. Customer applications all over the world went down and folks couldn’t do anything about it for an entire business day.

And here’s what the status page looked like (link):

?

Is it just me, or does it seem like the postmortem and status page are describing two entirely different realities?

My team has the entry-level paid support plan now (we activated it on the 27th after things cooled down), and our first (and so far only) engagement was excellent. Opening the ticket was easy, Daniel responded exactly 30 minutes after submission with a clear, friendly, and helpful resolution. No notes.

So next time there’s a global outage I’ll open a ticket instead of smashing refresh on the status page. And maybe (hopefully) that’ll result in more clarity while the world is burning.

And if it does, I guess I’ll pay it forward by hopping over here into the forum to keep y’all updated.

khuezy · December 6, 2024, 2:14am

The fly status page
17334512620508169844869325947441

wjordan · December 6, 2024, 2:59am

If you’re concerned about the ‘uptime’ graphs in the latter two screenshots, I can give some context: at some point in recent months, we had added a couple additional regions to our StatusPage ‘components’ set, which inadvertently re-activated the ‘uptime showcase’ StatusPage feature (I guess it was enabled by default) on the ‘Regional Availability’ component group for those newly-added regions. As I’ve written about before, we don’t use this ‘uptime’ feature because our global status updates have nuanced impact and don’t accurately correlate to uptime as a neat and tidy metric. (We want to publish actual detailed availability metrics straight from our platform and hope to eventually! We’re still working on it.) We disabled the feature on those components when we noticed the issue during this incident.

Hope this clarifies what you saw on the page, sorry this caused extra confusion for you during the incident.

kyleatcausadix · December 6, 2024, 5:06pm

If you’re concerned about the ‘uptime’ graphs in the latter two screenshots, I can give some context

Thanks, ~~Jordan~~ Will (sorry!)! That bit of context is helpful and appreciated.

We want to publish actual detailed availability metrics straight from our platform and hope to eventually! We’re still working on it.

Very much looking forward to that

Hope this clarifies what you saw on the page, sorry this caused extra confusion for you during the incident.

Let’s talk about this one some more?

Briefly, let’s establish shared context by recapping some truths:

The status page confused many new customers and several tenured ones (including myself, if >= 10 months on the platform is the bar ). By my count, at least 7 new accounts with a first post in that topic specifically mention confusion from unclear communication.
Regional availability (forget about the uptime showcase) was marked Operational and not Degraded while multiple (many?) regions were failing to service customer traffic. Indeed, https://fly.io itself serviced 504s during the incident. I might misunderstand what that status is supposed to communicate.
Deployments was marked Operational and not Degraded while many customers (including myself) were unable to deploy. Given Machines API was Degraded, weren’t bluegreen deployments, by definition, also at least Degraded?
Remote Builds was marked Operational and not Degraded while many customers (including myself) were unable to ~~deploy~~ build. Given Machines API was Degraded, weren’t apps that needed to create new builder machines, by definition, also at least Degraded?
Fly Machine Image Registry (1 and 2) were marked Operational and not Degraded while customers (including myself) were unable to push new images. I observed 504 timeouts both authenticating and while pushing.
While not listed under the Extensions component, Sentry account creation is initiated via flyctl ext sentry *. I was unable to access Sentry during the incident because the authentication flow consistently failed due to timeouts. Extensions was marked Operational and not Degraded.

My question is this, @wjordan: given so much of Fly’s platform has a SPOF on both Corrosion and Machines API (and the GraphQL API), is it fair to say that many other components, including those listed above, were misrepresented as Operational during the incident when they were also exhibiting degraded performance and instability?

Thank you for taking the time to hop into this thread. I know many customers, myself included, appreciate your time and consideration while helping us understand how Fly’s incidents have impacted our businesses and how we can better adapt and build resilience to these challenges. After all, every one of Fly’s customers (including myself) are rooting for the platform to succeed.

charsleysa · December 9, 2024, 11:08pm

@wjordan is there any response to this?

Also a little saddening that the Postmortems and Infra Logs in general are no longer posted to the forum.

sevenseacat · December 10, 2024, 1:05am

I’m glad Fly support has gotten a lot better; when we’ve needed it in the past (maybe two years ago?) it was appalling.

I don’t like the use of the word “degraded” in the status page - to me, this implies that things are still kind of working, but maybe not too well. Things weren’t working. The API was completely down for hours. This isn’t yellow-text “degraded”, this should be red-text “offline”.

wjordan · December 10, 2024, 1:33am

Looking back with hindsight, I would agree that in the latter phase of the incident once the GraphQL API backend became unavailable (‘Incident 2’ as described in the Infra Log), we could have additionally marked ‘Deployments’ as being affected. That would have been consistent with some past GraphQL API incidents where we have marked a combination of ‘Dashboard, Machines API, and Deployments’ (example).

Even beyond the confusion in this case caused by overlapping incidents, we haven’t consistently marked some of the more ambiguous components (like ‘Deployments’). We simplified our status page components a few months back, but I’ll take this feedback to suggest that further work in this direction to simplify and clarify incident components would be helpful.

In general, we won’t typically mark all downstream components when an upstream service has an issue, just the most directly relevant ones. For example, we wouldn’t mark all 35 “Regional Availability” components during an outage with global impact (we would mark the ‘Customer Applications’ component). If we released a flyctl version that breaks authentication, we would probably mark the ‘Deployments’ component and not every other service a customer might be using flyctl to interact with.

I’m sorry you (and any others) feel this way, and perhaps more than your specific concerns I want to address this general one. I think part of this feeling might be mismatched expectations around our status page- we use it as a tool to to quickly communicate timely, relevant, but fully-human (and thus flawed) updates with details on incidents impacting a broad set of customers and apps. It’s never going to be a perfect, accurate accounting of uptime and availability across our system- we hope to eventually fill this gap by providing system metrics directly. I think part of this come from Statuspage’s design not matching our use, and we have been looking for a tool better aligned with our purpose here.

I imagine the other part of this feeling is just frustration over the incident itself. Yes, this incident sucked, I completely agree. It was also an extremely confusing sequence of overlapping events with extremely unclear customer impact, all of which made it incredibly difficult for us to communicate clearly about in the moment, and this frustrated us as much as it did you. We gave it our best at the time, and then gave it sustained, honest reflection for weeks afterwards, both internally and in our public review. I hope you can appreciate these efforts as more than just status page marketing spin.

mikey · December 10, 2024, 4:20am

I share a lot of @kyleatcausadix’s feedback on this (thank you for taking the time to write it), and thank you @wjordan for responding.

There’s one thing I think Fly can do better that long predates this issue, and I was sad to see it was not part of the postmortem “next steps”. It’s with respect to this:

Use probers. Specifically, please run deployments continuously and post the status to the status page.

Empirically over several years here, deploys failing for one reason or another has been the number 1 failure mode we experienced. An automated matrix of (date, deploy success/fail) x (region) would do far more to explain the state of the world than much else on the status page - and wouldn’t put additional statuspage/stakeholder management burden on the oncall person in an outage.

(You can extend the concept to other kinds of probers too. I just wanna see more public transparency & accountability around deploys. If we can’t deploy when we need to, it’s as bad as servers being down.)

Thanks!

mayailurus · December 10, 2024, 7:17pm

Hm… As a small correction, one does actually exist—although it’s not cited at all from the page that everything links to…

https://fly.io/infra-log/feed.xml

(I discovered this by accident, while experimenting with manually tweaking the calendar images’ URLs.)

And, to answer @ktosiek’s question from last month, there are also (very) non-obvious individual pages for each weeks’ updates, which is convenient!

system · December 17, 2024, 7:17pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.