PSA: Postmortem for the Nov 25 Outage

mikey · December 10, 2024, 4:20am

I share a lot of @kyleatcausadix’s feedback on this (thank you for taking the time to write it), and thank you @wjordan for responding.

There’s one thing I think Fly can do better that long predates this issue, and I was sad to see it was not part of the postmortem “next steps”. It’s with respect to this:

Use probers. Specifically, please run deployments continuously and post the status to the status page.

Empirically over several years here, deploys failing for one reason or another has been the number 1 failure mode we experienced. An automated matrix of (date, deploy success/fail) x (region) would do far more to explain the state of the world than much else on the status page - and wouldn’t put additional statuspage/stakeholder management burden on the oncall person in an outage.

(You can extend the concept to other kinds of probers too. I just wanna see more public transparency & accountability around deploys. If we can’t deploy when we need to, it’s as bad as servers being down.)

Thanks!