Looking back with hindsight, I would agree that in the latter phase of the incident once the GraphQL API backend became unavailable (‘Incident 2’ as described in the Infra Log), we could have additionally marked ‘Deployments’ as being affected. That would have been consistent with some past GraphQL API incidents where we have marked a combination of ‘Dashboard, Machines API, and Deployments’ (example).
Even beyond the confusion in this case caused by overlapping incidents, we haven’t consistently marked some of the more ambiguous components (like ‘Deployments’). We simplified our status page components a few months back, but I’ll take this feedback to suggest that further work in this direction to simplify and clarify incident components would be helpful.
In general, we won’t typically mark all downstream components when an upstream service has an issue, just the most directly relevant ones. For example, we wouldn’t mark all 35 “Regional Availability” components during an outage with global impact (we would mark the ‘Customer Applications’ component). If we released a flyctl version that breaks authentication, we would probably mark the ‘Deployments’ component and not every other service a customer might be using flyctl to interact with.
I’m sorry you (and any others) feel this way, and perhaps more than your specific concerns I want to address this general one. I think part of this feeling might be mismatched expectations around our status page- we use it as a tool to to quickly communicate timely, relevant, but fully-human (and thus flawed) updates with details on incidents impacting a broad set of customers and apps. It’s never going to be a perfect, accurate accounting of uptime and availability across our system- we hope to eventually fill this gap by providing system metrics directly. I think part of this come from Statuspage’s design not matching our use, and we have been looking for a tool better aligned with our purpose here.
I imagine the other part of this feeling is just frustration over the incident itself. Yes, this incident sucked, I completely agree. It was also an extremely confusing sequence of overlapping events with extremely unclear customer impact, all of which made it incredibly difficult for us to communicate clearly about in the moment, and this frustrated us as much as it did you. We gave it our best at the time, and then gave it sustained, honest reflection for weeks afterwards, both internally and in our public review. I hope you can appreciate these efforts as more than just status page marketing spin.