Reliability: It's Not Great

Thanks; 'twas a bit hard to read so I can only imagine.

As for tomatoes: I’ve not known a simpler compute platform (cue Rich Hickey’s Simple made Easy), and given the team’s background, pretty sure Fly will get better at sandwiches stateful workloads and incident management, too. Best.

2 Likes

I appreciate this post! My SaaS company has a use case for which Fly would be a great fit.

However, this workload is very important to our business, so I’ve approached Fly with caution. I started by putting some personal projects on Fly’s (very generous) free tier. It was a mixed bag. I now suspect that I perhaps ran into the volume issue on one of my projects – I was doing something that didn’t seem dangerous/controversial, but ended up with an undeployable app and downtime.

My impression based on those experiences is that I wouldn’t be comfortable moving our business to it just yet. At least, not without having a plan to quickly cut Fly out of the loop.

This article goes a long way towards helping me believe that Fly can be, and hopefully will be, the best place for this workload someday. Thanks for the transparency.

4 Likes

As a fly.io customer I’m very happy to read this. Reliability is definitely your core value proposition in my mind, and I have experienced a few reliability issues recently (enough that I’m now very glad I didn’t migrate our production systems at $OLD_DAY_JOB to fly.io). It’s great to hear that you’re taking this seriously and have a plan in place to fix things. I’ll probably reevaluate fly’s production-readiness in ~6 months once you’ve had a change to work things out.

I’m also super happy to hear that you’re planning to ship a fully managed Postgres. A managed data store is really THE thing I want from a cloud provider. Running applications on a platform like fly.io is convenient, but running applications on a plain linux VM isn’t all that hard either. The one thing I really don’t want to manage if I can help it is the data store where durability is critical and hard to get right without experienced ops personnel. If you can ship a managed postgres that gives me access to logical replication slots then I’ll be singing your praises to whoever will listen.

Finally, I have a request for something you haven’t mentioned: better error handling / debugability / observability into the fly.io system. When I’ve had errors deploying to fly.io the error messages have been pretty unhelpful. I have had the generic and cryptic “Failed due to unhealthy allocations” in two separate scenarios:

  1. My app was compiled with two new a version of glibc and (presumably) crashed on startup. I would expect to get an “app crashed on startup” error message here with at least the process exit code and ideally some kind of debugging information (although I understand this is a tricky case where there might not be much available).

  2. During a brief a period of downtime. I’m ok with some limited amounts of downtime, but I expect your system to that it is at fault and not leave me chasing around trying to work out how I’ve managed to break

4 Likes

I’m not a huge customer of Fly.io (yet!) but I’ve loved the experience so far. I think if you guys focus on reliability over adding new capabilities or new customers for a while, you have the potential to make this service rock-solid.

Also, some of the things you mention, like volumes being tied to hosts instead of floating like EBS, is a plus from my perspective. EBS is slow and expensive. When I want to run a database, I want the volume on the machine if at all possible. I don’t need to move the disk image between machines - I need rock-solid backups of the volume or the DB (maybe even a copy as a warm standby).

I’m sure it’s a slog, but doing the grunt work step-by-step is where you create the long-term value for yourselves and your customers.

4 Likes

Thanks for this post – really appreciate the transparency :heart:. I loved the explanation of how volumes are pinned to host hardware, which I hadn’t understood (I actually want to use them more now that I know how they work!).

Just wanted to say that I’ve loved having fly as a way to deploy my projects despite the occasional hiccups.

5 Likes

really appreciate this. Especially recognizing the disparity between the “haha let’s write a fun and sorta snarky blog post about distributed systems and how we know how to do them!!!” while things are totally down or partially down. I don’t have a huge workload on Fly.io yet, but the reliability has been a pretty sore spot for me so far. Hard to think of any other service (compute or not) I’ve used that has had similar spottiness.

That being said, I think if y’all really focus everything on reliability and communication, the upside is still incredibly high. I would personally take an actually-reliable service over basically any other feature offering you’re thinking of delivering or working on right now. For example. the Postgres service / offering - hard to imagine considering it any more than I have already while the core service reliability seems to be pretty low. Just an example, but I think it would apply to any other feature — I’d ask “how can I trust XYZ thing if the basic ability to deliver / deploy a service is shaky/flaky?” Everything gets built on core trust + reliable systems. :hugs:

last thing I’ll say: would rather see actual reliability / a solid service over a blog post writing about it every single time. Actions > words and all that. Ty all and best of luck! :smile:

2 Likes

Also good example: maybe it’s just me or something local, but as I’m writing this post https://fly.io/ is sending a 502 to me. Not great :man_shrugging:

@markthethomas you’re spending enough that the plans with support emails are “free”. We’re happy to help you troubleshoot 502s. This probably isn’t us, but we’ll check anyway.

1 Like

no worries, could be “just me” but idk why the main site would be down just for me. not urgent, but I appreciate y’all reaching out :slight_smile:

Oh I misread that! A 502 from Fly.io is definitely not just you, it’s probably a bug.

1 Like

<3

Thank you so much for your openness and transparency. You’ll get through this.

2 Likes

Thank you for sharing this. I can imagine that a lot of us rushing from Heroku took you by surprise. And I can certainly feel you stabbing the problems multiple times, when the docs keep changing and a couple of months old post on Community is completely out of date :sweat_smile:

But I’m very happy with Fly overall and I do wish it/you all the best and hope you’ll pull through and keep improving the service and keeping it alive and well :crossed_fingers:

2 Likes

all good!

Serious reply:

I don’t have a need for Fly currently, but I love your blog posts and the humility in this post. I hope someday I will have an opportunity to put an app on Fly.io

Less serious reply:

The problem with Corrosion is that it’s new and gossip based consistency is a difficult problem.

I’m glad I’m not the only one having trouble with Challenge #3b: Multi-Node Broadcast · Fly Docs (implementing gossip protocol)

2 Likes

Wow you all are incredibly kind. I’m a little overwhelmed by the response so far, thank you.

12 Likes

I really appreciate this post. I’m glad you wrote it.

I’m sorry to say, wow do I completely agree it has been rough. We were considering Fly for a significant site migration and just ending up going with AWS because I did not have much confidence in the service — even though I absolutely love the promise of it.

But good you’re tackling it and can say the truth out loud. That’s the only way it gets better.

2 Likes

Still a huge fan of Fly. You are shooting the moon and for the most part still on target. It’s still one of the most well designed and tasteful cloud offerings to date.

1 Like

Such outstanding transparency has got this #1 on Hacker News (probably bringing a new load of users :slightly_smiling_face:). That’s a good problem though.

3 Likes

Really appreciate you owning this @kurt. Much respect.

1 Like

This is one of my favorite blog posts of all time.

We’ve been on fly for a while now and have definitely had some issues here and there, but for the most part I’ve been ecstatic about our choice.

The only thing that’s been nagging me is transparency, sometimes I’ve felt like things have gone awry, and the problem hasn’t been acknowledged until after quite some time has passed. This is a massive step in the right direction. Thank you!

Rooting for you big time and proud to be a customer.

3 Likes