Postgres reliability updates and etcd shenanigans

We just finished up a week long Postgres party. It was unexpected, we ended up doing a tremendous amount of work because some peoples’ started rejecting connections, getting wedged in an unhealthy state, and otherwise causing chaos.

First things first, we really appreciate y’all testing Postgres while it’s in beta. These kinds of problems only manifest when people put databases through the wringer, and despite some low grade whining on my part, it’s been helpful to see these things in action so we can build a more reliable service.

Postgres clusters on Fly are a little different than most DBaaS offerings. Rather than building a black box database service, we shipped a Fly app, a shared tenant Consul service, and CLI magic to help you link it apps to a database. We’ve been mostly hands off with individual databases. We have, however, watched databases as a whole to see which parts of our stack we could improve.

Consul created some problems. Our shared Consul cluster is in North America. We had a few issues with database clusters on other continents losing Consul connections and becoming read only. This is partially because high latency connections to Consul are suboptimal, but the real problem is that Stolon (the OSS project we use for coordination) is terrible at handling Consul network interruptions. And because of this terribleness, Consul would throttle Stolon’s connections. The shared Consul cluster is pretty dang busy – it serves 120 million requests per day.

Stolon has better support for etcd. We spent a couple of months testing Stolon + etcd and results were great. Stolon recovered after random connection failures and handled etcd server chaos well. We went so far as to destroy etcd server members repeatedly to see if we could make Stolon break. It did not.

So we deployed a shared etcd cluster, added etcd auth support to Stolon, and pointed a couple of new Postgres clusters at it. These worked great! Zero failures. So we rolled it out more broadly, as of late July all new Postgres clusters were using etcd for lock coordination.

A few weeks later, we started seeing strange, intermittent errors with {"error": "etcdserver: user name is empty"} logs. The tldr here is that we were using etcd bearer token auth (simple tokens), and servers would occasionally lose track of tokens. Weird. The etcd docs recommend against using simple tokens in production, though, they suggest etcd jwt tokens. JWT is dumb but worth a try.

We setup some jwt tests which were, once again, very good. Then we tried a few customer postgres clusters with jwt auth. The “user name empty” errors didn’t manifest and things seemed really smooth. So we rolled it out more broadly (you might notice a theme here). As it turns out, adding users or roles to etcd expires all previously generated tokens, which breaks Stolon. So as soon as people started creating new postgres clusters, existing postgres clusters encountered errors.

You won’t be shocked to know: we’re back to Consul for new clusters. We’ve spent the last 5 days or so implementing a bunch of Consul / Stolon improvements. It seems to be working very well so we’re betting this is the long term path forward, potentially with per region Consul clusters and additional Stolon tweaks.

We also discovered a bug running script checks that would exacerbate postgres problems. When stolon + consul degraded, script checks would occasionally back up and end up making our init process unresponsive. We switched all the Postgres health checks to HTTP checks. HTTP checks hit app processes directly and bypass the init code path that is responsible for running script checks. If you run fly checks list you’ll see some HTTP status codes mixed in, this is why.

We’re working on management options, as well. We’ve learned that most of you all don’t configure alerts for when your postgres goes unhealthy. It’s clear that “unmanged postgres” is useful for development, but most people who use Fly for apps they want to show people would benefit from a managed DB offering they can simply forget about.

13 Likes

@kurt Just wanted to say thanks, it sounds like it was a nightmare. My Db has now been up and stable for 5 days. Looks like it is fixed. Great work to you and the team.

1 Like

So far, same here. We used to get one-off database connection errors if our app had been idle for several days (fixed by a refresh, but still disconcerting), and that seems to have been resolved, as well. Thanks!

1 Like

Appreciate all the work you’re putting into this! Sadly I’ve had yet another random downtime in my DB yesterday, so I wanted to clarify what’s the best option for my existing DB to move to a cluster with Consul - would it take manual DB export and upload of the data? Or would setting up new volumes as you outlined for me here would work?