Thanks; 'twas a bit hard to read so I can only imagine.
As for tomatoes: I’ve not known a simpler compute platform (cue Rich Hickey’s Simple made Easy), and given the team’s background, pretty sure Fly will get better at sandwiches stateful workloads and incident management, too. Best.
I appreciate this post! My SaaS company has a use case for which Fly would be a great fit.
However, this workload is very important to our business, so I’ve approached Fly with caution. I started by putting some personal projects on Fly’s (very generous) free tier. It was a mixed bag. I now suspect that I perhaps ran into the volume issue on one of my projects – I was doing something that didn’t seem dangerous/controversial, but ended up with an undeployable app and downtime.
My impression based on those experiences is that I wouldn’t be comfortable moving our business to it just yet. At least, not without having a plan to quickly cut Fly out of the loop.
This article goes a long way towards helping me believe that Fly can be, and hopefully will be, the best place for this workload someday. Thanks for the transparency.
As a fly.io customer I’m very happy to read this. Reliability is definitely your core value proposition in my mind, and I have experienced a few reliability issues recently (enough that I’m now very glad I didn’t migrate our production systems at $OLD_DAY_JOB to fly.io). It’s great to hear that you’re taking this seriously and have a plan in place to fix things. I’ll probably reevaluate fly’s production-readiness in ~6 months once you’ve had a change to work things out.
I’m also super happy to hear that you’re planning to ship a fully managed Postgres. A managed data store is really THE thing I want from a cloud provider. Running applications on a platform like fly.io is convenient, but running applications on a plain linux VM isn’t all that hard either. The one thing I really don’t want to manage if I can help it is the data store where durability is critical and hard to get right without experienced ops personnel. If you can ship a managed postgres that gives me access to logical replication slots then I’ll be singing your praises to whoever will listen.
Finally, I have a request for something you haven’t mentioned: better error handling / debugability / observability into the fly.io system. When I’ve had errors deploying to fly.io the error messages have been pretty unhelpful. I have had the generic and cryptic “Failed due to unhealthy allocations” in two separate scenarios:
My app was compiled with two new a version of glibc and (presumably) crashed on startup. I would expect to get an “app crashed on startup” error message here with at least the process exit code and ideally some kind of debugging information (although I understand this is a tricky case where there might not be much available).
During a brief a period of downtime. I’m ok with some limited amounts of downtime, but I expect your system to that it is at fault and not leave me chasing around trying to work out how I’ve managed to break
I’m not a huge customer of Fly.io (yet!) but I’ve loved the experience so far. I think if you guys focus on reliability over adding new capabilities or new customers for a while, you have the potential to make this service rock-solid.
Also, some of the things you mention, like volumes being tied to hosts instead of floating like EBS, is a plus from my perspective. EBS is slow and expensive. When I want to run a database, I want the volume on the machine if at all possible. I don’t need to move the disk image between machines - I need rock-solid backups of the volume or the DB (maybe even a copy as a warm standby).
I’m sure it’s a slog, but doing the grunt work step-by-step is where you create the long-term value for yourselves and your customers.
Thanks for this post – really appreciate the transparency . I loved the explanation of how volumes are pinned to host hardware, which I hadn’t understood (I actually want to use them more now that I know how they work!).
Just wanted to say that I’ve loved having fly as a way to deploy my projects despite the occasional hiccups.
really appreciate this. Especially recognizing the disparity between the “haha let’s write a fun and sorta snarky blog post about distributed systems and how we know how to do them!!!” while things are totally down or partially down. I don’t have a huge workload on Fly.io yet, but the reliability has been a pretty sore spot for me so far. Hard to think of any other service (compute or not) I’ve used that has had similar spottiness.
That being said, I think if y’all really focus everything on reliability and communication, the upside is still incredibly high. I would personally take an actually-reliable service over basically any other feature offering you’re thinking of delivering or working on right now. For example. the Postgres service / offering - hard to imagine considering it any more than I have already while the core service reliability seems to be pretty low. Just an example, but I think it would apply to any other feature — I’d ask “how can I trust XYZ thing if the basic ability to deliver / deploy a service is shaky/flaky?” Everything gets built on core trust + reliable systems.
last thing I’ll say: would rather see actual reliability / a solid service over a blog post writing about it every single time. Actions > words and all that. Ty all and best of luck!
Thank you for sharing this. I can imagine that a lot of us rushing from Heroku took you by surprise. And I can certainly feel you stabbing the problems multiple times, when the docs keep changing and a couple of months old post on Community is completely out of date
But I’m very happy with Fly overall and I do wish it/you all the best and hope you’ll pull through and keep improving the service and keeping it alive and well
I really appreciate this post. I’m glad you wrote it.
I’m sorry to say, wow do I completely agree it has been rough. We were considering Fly for a significant site migration and just ending up going with AWS because I did not have much confidence in the service — even though I absolutely love the promise of it.
But good you’re tackling it and can say the truth out loud. That’s the only way it gets better.
This is one of my favorite blog posts of all time.
We’ve been on fly for a while now and have definitely had some issues here and there, but for the most part I’ve been ecstatic about our choice.
The only thing that’s been nagging me is transparency, sometimes I’ve felt like things have gone awry, and the problem hasn’t been acknowledged until after quite some time has passed. This is a massive step in the right direction. Thank you!
Rooting for you big time and proud to be a customer.