PSA: Postmortem for the Nov 25 Outage

kyleatcausadix · December 6, 2024, 5:06pm

If you’re concerned about the ‘uptime’ graphs in the latter two screenshots, I can give some context

Thanks, ~~Jordan~~ Will (sorry!)! That bit of context is helpful and appreciated.

We want to publish actual detailed availability metrics straight from our platform and hope to eventually! We’re still working on it.

Very much looking forward to that

Hope this clarifies what you saw on the page, sorry this caused extra confusion for you during the incident.

Let’s talk about this one some more?

Briefly, let’s establish shared context by recapping some truths:

The status page confused many new customers and several tenured ones (including myself, if >= 10 months on the platform is the bar ). By my count, at least 7 new accounts with a first post in that topic specifically mention confusion from unclear communication.
Regional availability (forget about the uptime showcase) was marked Operational and not Degraded while multiple (many?) regions were failing to service customer traffic. Indeed, https://fly.io itself serviced 504s during the incident. I might misunderstand what that status is supposed to communicate.
Deployments was marked Operational and not Degraded while many customers (including myself) were unable to deploy. Given Machines API was Degraded, weren’t bluegreen deployments, by definition, also at least Degraded?
Remote Builds was marked Operational and not Degraded while many customers (including myself) were unable to ~~deploy~~ build. Given Machines API was Degraded, weren’t apps that needed to create new builder machines, by definition, also at least Degraded?
Fly Machine Image Registry (1 and 2) were marked Operational and not Degraded while customers (including myself) were unable to push new images. I observed 504 timeouts both authenticating and while pushing.
While not listed under the Extensions component, Sentry account creation is initiated via flyctl ext sentry *. I was unable to access Sentry during the incident because the authentication flow consistently failed due to timeouts. Extensions was marked Operational and not Degraded.

My question is this, @wjordan: given so much of Fly’s platform has a SPOF on both Corrosion and Machines API (and the GraphQL API), is it fair to say that many other components, including those listed above, were misrepresented as Operational during the incident when they were also exhibiting degraded performance and instability?

Thank you for taking the time to hop into this thread. I know many customers, myself included, appreciate your time and consideration while helping us understand how Fly’s incidents have impacted our businesses and how we can better adapt and build resilience to these challenges. After all, every one of Fly’s customers (including myself) are rooting for the platform to succeed.