Short answer: yes, we are working on that. We agree, it does not feel good when broken things come as a surprise.
Everything we have been doing (and continue) to do in the last month or so is in the service of reliability. Part of that is making it easier to proactively know when things break, both for you and for us. For example we shipped an individualized status page to show you when specific host or disk failures are effecting your apps. And for postgres specifically we are actively working on tooling to give you better insights into the health of your PG cluster.
Internally, we continue to improve our monitoring and processes to help us catch things before they break so we are caught by surprise way less often. We have also build out a really talented infra-ops team which are hard at work on this which is taking the pressure of the three wizards who have been single-handedly keeping our servers alive and letting them devote energy to improving the platform long term.
So the TL;DR is yes: we are working to get better at managing surprises when they come up but we are also working hard to make sure there are less surprises in the first place.