Reliability: It's Not Great

As a fly.io customer I’m very happy to read this. Reliability is definitely your core value proposition in my mind, and I have experienced a few reliability issues recently (enough that I’m now very glad I didn’t migrate our production systems at $OLD_DAY_JOB to fly.io). It’s great to hear that you’re taking this seriously and have a plan in place to fix things. I’ll probably reevaluate fly’s production-readiness in ~6 months once you’ve had a change to work things out.

I’m also super happy to hear that you’re planning to ship a fully managed Postgres. A managed data store is really THE thing I want from a cloud provider. Running applications on a platform like fly.io is convenient, but running applications on a plain linux VM isn’t all that hard either. The one thing I really don’t want to manage if I can help it is the data store where durability is critical and hard to get right without experienced ops personnel. If you can ship a managed postgres that gives me access to logical replication slots then I’ll be singing your praises to whoever will listen.

Finally, I have a request for something you haven’t mentioned: better error handling / debugability / observability into the fly.io system. When I’ve had errors deploying to fly.io the error messages have been pretty unhelpful. I have had the generic and cryptic “Failed due to unhealthy allocations” in two separate scenarios:

  1. My app was compiled with two new a version of glibc and (presumably) crashed on startup. I would expect to get an “app crashed on startup” error message here with at least the process exit code and ideally some kind of debugging information (although I understand this is a tricky case where there might not be much available).

  2. During a brief a period of downtime. I’m ok with some limited amounts of downtime, but I expect your system to that it is at fault and not leave me chasing around trying to work out how I’ve managed to break

4 Likes