I’m glad to read you probably got to figure out the technical aspect of this repeating major issue.
But this part is what concerns me the most to be honest:
If I understand correctly, here is the sequence of events that occurred:
- An alarm did trigger internally.
- The monitoring team decided to ignore it because it was the weekend — doesn’t this team have people dedicated to handling weekend incidents? I would assume weekends are the most risky time of the week, so a fully managed hosting service would have a team dedicated to that timeframe.
- The engineering team decided not to look into in any further coming back to the office on Monday.
- The support team decided to ignore support topic explicitly titled Two major outages in two days, and yet no status updates?
That’s a lot of teams deciding to ignore various sources of reporting, even though the internal alarm system did actually trigger.
I have been relying on Fly.io for years and things have been going great most of the time, but this makes me wonder if it was mostly luck. It took 3 major outages in just 8 days for someone to actually start look into it.
I understand that the geo aspect of the issue made it harder to detect and investigate, but this is precisely part of Fly.io’s offering, so I would have thought you had great tooling and experience internally to address this kind of issue.