Reliability: It's Not Great

nicoburns · March 6, 2023, 7:27pm

As a fly.io customer I’m very happy to read this. Reliability is definitely your core value proposition in my mind, and I have experienced a few reliability issues recently (enough that I’m now very glad I didn’t migrate our production systems at $OLD_DAY_JOB to fly.io). It’s great to hear that you’re taking this seriously and have a plan in place to fix things. I’ll probably reevaluate fly’s production-readiness in ~6 months once you’ve had a change to work things out.

I’m also super happy to hear that you’re planning to ship a fully managed Postgres. A managed data store is really THE thing I want from a cloud provider. Running applications on a platform like fly.io is convenient, but running applications on a plain linux VM isn’t all that hard either. The one thing I really don’t want to manage if I can help it is the data store where durability is critical and hard to get right without experienced ops personnel. If you can ship a managed postgres that gives me access to logical replication slots then I’ll be singing your praises to whoever will listen.

Finally, I have a request for something you haven’t mentioned: better error handling / debugability / observability into the fly.io system. When I’ve had errors deploying to fly.io the error messages have been pretty unhelpful. I have had the generic and cryptic “Failed due to unhealthy allocations” in two separate scenarios:

My app was compiled with two new a version of glibc and (presumably) crashed on startup. I would expect to get an “app crashed on startup” error message here with at least the process exit code and ideally some kind of debugging information (although I understand this is a tricky case where there might not be much available).
During a brief a period of downtime. I’m ok with some limited amounts of downtime, but I expect your system to that it is at fault and not leave me chasing around trying to work out how I’ve managed to break

Topic		Replies	Views
Stability issues Questions / Help	11	1594	February 22, 2023
Something went wrong? Questions / Help	42	1506	September 22, 2022
fly.io site is currently inaccessible...	83	3268	December 5, 2024
Service Interruption: Can't Destroy Machine, Deploy, or Restart Questions / Help rails	28	4716	July 28, 2023
This is only possible on fly...	13	3671	April 17, 2023

Reliability: It's Not Great

Related topics