Proxy Fairness, Take 2 (Or: Why Did I Have Trouble Connecting to Fly in Europe)

I’m glad to read you probably got to figure out the technical aspect of this repeating major issue.

But this part is what concerns me the most to be honest:

If I understand correctly, here is the sequence of events that occurred:

  1. An alarm did trigger internally.
  2. The monitoring team decided to ignore it because it was the weekend — doesn’t this team have people dedicated to handling weekend incidents? I would assume weekends are the most risky time of the week, so a fully managed hosting service would have a team dedicated to that timeframe.
  3. The engineering team decided not to look into in any further coming back to the office on Monday.
  4. The support team decided to ignore support topic explicitly titled Two major outages in two days, and yet no status updates?

That’s a lot of teams deciding to ignore various sources of reporting, even though the internal alarm system did actually trigger.

I have been relying on Fly.io for years and things have been going great most of the time, but this makes me wonder if it was mostly luck. It took 3 major outages in just 8 days for someone to actually start look into it.

I understand that the geo aspect of the issue made it harder to detect and investigate, but this is precisely part of Fly.io’s offering, so I would have thought you had great tooling and experience internally to address this kind of issue.

1 Like