Reliability: It's Not Great

Such outstanding transparency has got this #1 on Hacker News (probably bringing a new load of users :slightly_smiling_face:). That’s a good problem though.

3 Likes

Really appreciate you owning this @kurt. Much respect.

2 Likes

This is one of my favorite blog posts of all time.

We’ve been on fly for a while now and have definitely had some issues here and there, but for the most part I’ve been ecstatic about our choice.

The only thing that’s been nagging me is transparency, sometimes I’ve felt like things have gone awry, and the problem hasn’t been acknowledged until after quite some time has passed. This is a massive step in the right direction. Thank you!

Rooting for you big time and proud to be a customer.

3 Likes

Just chiming in to say my experience with Fly has been great so far. I’m not a huge production user (my app is a side project), so I know things are different if you’re running Very Important Things. But I find the DX to be great, the support docs are awesome/comprehensive, and support has been quick/helpful.

Keep going!

1 Like

I’m rooting for y’all! This was a great read. Thanks for sharing. Best of luck Team Fly!!

4 Likes

AYFKM? If you can’t handle new customers - don’t let users sign up!! All this garbage I’m reading above and I’m just trying to deploy a simple app and it’s just hanging - this sucks, I’ve wasted my time learning this new system just to realize I’ll need to go to another provider - and all you do is post on the community board wtf is this? Why don’t you email your customers and tell them your system is unreliable so we can abandon it sooner. Why would Remix recommend this garbage product!!!

Don’t take it too close to your heart. The current Fly is nearly excellent as it is, at least from a customer point of view.

The problems derive from small imperfections which start to multiply to something bigger when you grow. I know only one ultimate solution that can truly fix that - build on solid foundations. Every piece and every layer should be productized and tested by corresponding unit tests. If all building blocks are up to spec and meet the quality bar then the combined result speaks for itself. Divide and conquer. Instead of being constantly pressured by M * N problem space you aim to be in a semi-relaxed M + N realm.

But there is a catch. Building on solid foundations means somewhat higher R&D costs. A lot of young businesses are not content with that, thinking that the problems will somehow go away by themselves. The truth is, they probably won’t. Instead, the problems tend to accumulate and multiply. As problems pile up, operating costs tend to increase way beyond the costs of a proper R&D.

Also, we love the simplicity of the current Fly. So please do not prioritize the new features too aggressively. Instead, consider to use the resources to master the existing functionality.

2 Likes

I raised some issues in support like dashboard sometimes not working or our apps not able to send requests to outside domains. The biggest problem with Fly is that they chose to close support for those who are not paying at least $29. As a startup founder I understand this move but this never works well when you have rough edges in your product. Do you have AWS like stability, then please go ahead and close your support but until then keep ears open.

Your post is just about what are problems. Great, we know it but it would be great if you can lay out plan to tackle these and tackle these fast. We are really considering to make Fly our primary infra (we love simplicity and global deployments) but these issues which just keeps popping randomly makes it very hard to trust Fly. Hope you do something and do it quickly.

If I were you, I would scrap plans for Postgres (RDS is just too good), LiteFS, using Phoenix etc and focus on just giving users rock solid infra. I know your LifeFS stuff getting ton of attention but that’s really not core business you are in right now.

Btw, thanks for acknowledging that things are not okay (many cloud providers never do).

3 Likes

Since you offered…

throws squishy tomatoes

But in all seriousness, it’s been an amazing journey over the past two years growing while also watching fly grow as well.

We’re know that we’re nowhere near your biggest customer but the personalized support we received in the early days was superb and helped us get up and running on fly with ease. While lately support has been slower I do understand the growing pains.

Communication has been slow recently but it’s great to see you acknowledge the short comings and are working towards improving them.

There’s plenty to love about fly, with 85% of our production infrastructure on fly we’re just waiting for you to offer the features for the other 15%.

1 Like

Thanks for the details. It sounds scary, but I hope you guys live long enough to see the other side of the tunnel!

Some thoughts from someone who is preparing to deploy on the fly.io platform soon (I will be hosting a poker app) -

I’ve seen reliability issues frequently during development. And every time I come across something, I wonder if I should post in the forums, or just assume that this is bound to happen and build around it. I’ve mostly chosen the latter, because I can imagine that a lot of these issues are probably already in your todo list, and will eventually get addressed.

The downside of that is, I have been spending non-trivial amount of time building a mini nomad within the backend (to orchestrate fly machines, and deal with unexpected failures). This makes me uncomfortable because I am not very experienced with this domain. However, one feedback is - if you can provide better APIs to listen to machine events for an app, or atleast query them with a 10s level of granularity, it will be a great help. I can again understand that this may be a very niche use case, but it will drastically improve the ability to react to fly’s infra failures for an application, should the application choose to do that.

Good luck to you all. Hopefully I won’t regret launching on top of fly :vulcan_salute:

PS. there’s no other offering that is offering what you are, and I have really enjoyed working with fly. So I echo the sentiments in this thread - if only your sandwiches held firm like grilled sandwiches.

2 Likes

I appreciate the honesty. I’ve been hesitant to move existing products to Fly because of the reliability issues I’ve seen, but I love spinning up prototypes and small apps on it.

I think this is just a necessary part of maturing as a company. I hope you pull through, get the reliability rock-solid, and I can see myself being a customer for a long time. Your core values sound great.

1 Like

What an awesomely honest post. Hang in there Kurt & team. Stay on target :wink:

Posts like this are a great reminder that we’re all human and all the tomato throwing in the world (no matter how great your aim or how fun), won’t mend anything.

I love what you’re doing and your learnings will no doubt go into some study in the near future.

If you ever wanted to do a podcast on built.fm, let me know. Would love to dig in and chat!

I have been using fly for prototypes so nothing critical (yet). The reliability issues were annoying at times and I am very glad to read that post. The fly DX is superb and I am sure reliability will catch up with time :slight_smile:

Looking forward to eventually move all my stuff to fly!

I believe in you guys and wish you the strongest wind behind your wings.
Thank you for being open about issues.

We do need global deployments and I hope there will be enough resources for you to advance in that field.

Is Corrosion based on Serf? Is Serf the part of Consul that doesn’t work well? If not Serf, what gossip protocol?

SWIM, iirc: https://archive.is/oZQFp

Thank you for posting this Kurt! I will counter that your developer UX is actually quite good; it’s part of why I am here. That and your transparency are why I will stick around! Keep at it, and thanks for giving so many of us a place to work on spinning up our own ideas and inventions.

1 Like

So is Serf: Gossip Protocol - Serf by HashiCorp

1 Like

Thanks a lot for being so transparent sharing all those internal details.

I’m just curious about the decision of use repmgr for HA. Unfortunately this tool is know for not doing very good job in HA leading to unnecessary split brains. Nowadays the most know and battle tested tool for Postgres HA is Patroni, so why did you choose repmgr over Patroni??

1 Like

Respect @kurt :facepunch:

This stuff is hard

3 Likes