Postgres database apps are crashing again

Seems to be similar to the last few times this has happened. Database services are cycling between up and down with errors HTTP GET http://172.19.2.114:5500/flycheck/pg: 500 Internal Server Error Output: "[✗] proxy: context deadline exceeded". It is causing outages in our staging and prod environment.

Yep, can confirm I’m seeing it within my apps as well.

1 Like

Same here, happened to me twice before and now again.

1 Like

We are investigating this issue. We will keep you posted.

1 Like

We are experiencing more issues with our legacy consul cluster… We are working to bring it back up and will be working to migrate more folks to our new consul clusters in the meantime.

1 Like

The Consul cluster is recovering and we are seeing Postgres clusters begin to recover as well. If anyone is still experiencing issues, please let me know.

One of the shared consul disks failed. This should be fine, but it seems like the remaining two nodes couldn’t handle the level of traffic Stolon was throwing at them. We’re expanding this consul cluster to 5 nodes, which should let us absorb traffic from a failed node.

We’re back, I’ve mentioned before but could your team work on the status site updating policy? We’ve just had a >30 minute outage impacting multiple clients and it didn’t get updated once. I’ve been on your side many times and recognise the stress involved in trying to recover from an outage and the desire to jump in and fix stuff but even just a one liner “Some customers are experiencing database connection issues, we are investigating” at the start would be enough. Knowing that my pagerduty alert is due to a infrastructure outage and not an app failure quickly would improve my life a lot.

Confirmed things are working again here, thanks!

Yes. Not that it should matter to you all, but this is on our checklist before we take Postgres out of beta. We just hired SREs and are currently getting Postgres to a spot where it can be SRE-ed (vs me and @shaun winging it).

1 Like

Thanks for the updates.

I’ll be honest, over the last couple of weeks my team has been discussing where need to draw the line and switch hosts. The number of outages we’ve experienced in the short time since going into production is concerning. We really like a lot of the features and capabilities of Fly, but we’re concerned that an outage at the wrong time during our early stages will be ruin the first impressions of our new company.

Are these issues related exclusively to the semi-managed postgres apps? If I were to bring up our own instances using fly-apps/postgres-ha, would we still have these issues?

We’re meeting tomorrow to discuss this. Unfortunately, if I don’t have a plan for how to make our setup more reliable on Fly, my recommendation is going to be for us to migrate hosts and re-evaluate Fly dafter there’s been some more time for stability to improve.

It’s worth keeping in mind that the Postgres service is currently in beta. I would be concerned if the core service was having outages like this but I think if you’re going to adopt a beta service, some level of downtime is to be expected.

The Postgres specific outages have all been specific to our Postgres setup (specifically Stolon + shared tenant Consul for leader elections). Stolon and shared tenant Consul are incredibly brittle, we’ve improved it a bunch but it’s still surprising.

To get more reliable Postgres on Fly you could:

  1. Run a single Postgres database (not using our HA) with no automated leader election.
  2. Create your own Consul cluster, set the internal URL to that cluster in your ENV, and then deploy

The irony of our clustered, high availability Postgres setup causing downtime is not lost on me. :confused:

Everyone’s been very patient with us in the forums about the outages. We’re not happy with the reliability, beta or no. It should be working better than this. We also feel warm fuzzies when people defend us but we don’t want to make excuses here!

1 Like

One other thing, I just noticed your Postgres DB is old. If you run fly image update it’ll get all our stability fixes. The most recent PG builds didn’t experience the same issues during this Consul outage (we relaxed a lot of stolon strictness checks and changed to haproxy).

Interesting, didn’t even know that was an option. Running the update now.

I appreciate that. I get that this is new as well, and you’re doing some unique things. I totally expect things like instance restarts and occasional blips, but I also need to make sure I’m doing what’s right for our infrastructure. If nothing else, I’ll likely continue to use Fly for my hobby projects, but I need to ensure stability for something that we’re going to have paying end users on.

Regarding the “beta” tag on Postgres. Where is this listed? I hadn’t actually realized that the postgres deployments were beta until I started frequenting the forums. I just went back and checked, and it doesn’t look like it’s ever mentioned that Postgres isn’t production ready in the documentation here: Postgres on Fly or here: Multi-region PostgreSQL (fly.io).

I thought updates were done automatically on the managed instances based on the response here: Unrequested postgres upgrade - #2 by kurt. Is that not the case?

I think I noticed it on a blog post or a forum post but I just went back and checked as well and you’re right, you can’t see it anywhere :thinking:

We stopped doing as many auto updates after your forum post and shipped the manual updater instead: Early look: PostgreSQL on Fly. We want your opinions. - #108 by shaun