Fly down?

All valid points. Things shouldn’t break as often they do. There’s isn’t much clarity at all sometimes (ex). And from observation, Fly seems to have a culture of releasing quickly and not really over-engineer stuff. I’ve called on them for more deliberation and the need to respect the scale of their operations once before, but it isn’t all that bad either.

…when the book hits the real world sometimes you find new failure modes, software has bugs, or humans find creative mistakes. It’s also very hard to build global scale systems with zero possibly of global failure. But every time a crack is found, you learn something and do what you can to eliminate the whole class of related failure modes.

That’s a comment from a Googler on GCP’s global outage: Obviously not authorized to release more details than have already been made pub... | Hacker News (2019).

Speaking from personal experience, I was in the team when DynamoDB (2015), Elasticsearch Service (2018) went down nearly globally (it was just IAD for both, which is also the “primary” region which meant a lot of other unexpected things also happened)… CloudFront also faced its own share of terrible outages over the years and the learnings from it were distilled into an internal-only talk at the time, which was so popular within AWS that it was eventually presented at re:Invent 2016: https://youtube.com/watch?v=n8qQGLJeUYA

This stuff isn’t simple to accomplish for a team as small as Fly. I mean, the CEO is still replying to customer support emails and forum posts.

I’m confident once they staff up, things will considerably improve.

6 Likes