Surprise management

binajmen · April 12, 2023, 9:14am

Dear Fly.io team,

I would like to continue a discussion that popped up from Can't reach database server in FRA but was never addressed.

It is about what @theo-m called “Surprise management”.

Although everything work smoothly most of the time, I’ve been taken by surprise several times, mostly by Postgres failing me for unclear reasons.

Sh*t happens, and I’m okay with the fact I have to dig into the forum/documentation to debug and restore something that broke. Although I strongly believe the KM generated here on how to restore service A or B when error C should be extracted from the forum, and put in a proper place.

But I’m less okay with the fact that I was not informed when something suddenly went broke.

I would really like to never ever have the “Oh sh*t, I’m sorry, I was not aware it was broken dear Mr. Client” discussion again.

Is there a section I’m unaware of in the Dashboard to manage that ?
Is it on your roadmap to improve the DX in this area ?

I’m not a highly skilled DevOps, yet Fly managed to give me great tools to deploy stuff easily. However, when something broke, this is stressful, and I feel with proper documentation + notification, I would feel more in control…

DAlperin · April 12, 2023, 8:44pm

Short answer: yes, we are working on that. We agree, it does not feel good when broken things come as a surprise.

Everything we have been doing (and continue) to do in the last month or so is in the service of reliability. Part of that is making it easier to proactively know when things break, both for you and for us. For example we shipped an individualized status page to show you when specific host or disk failures are effecting your apps. And for postgres specifically we are actively working on tooling to give you better insights into the health of your PG cluster.

Internally, we continue to improve our monitoring and processes to help us catch things before they break so we are caught by surprise way less often. We have also build out a really talented infra-ops team which are hard at work on this which is taking the pressure of the three wizards who have been single-handedly keeping our servers alive and letting them devote energy to improving the platform long term.

So the TL;DR is yes: we are working to get better at managing surprises when they come up but we are also working hard to make sure there are less surprises in the first place.

binajmen · April 13, 2023, 5:06am

I’m glad to hear that! I didn’t notice the fly migrate-to-v2 - postgres edition post. It confirms you’re actively moving things in the Postgres area