Postgres database apps are crashing again

bekit · October 18, 2021, 8:37pm

Seems to be similar to the last few times this has happened. Database services are cycling between up and down with errors HTTP GET http://172.19.2.114:5500/flycheck/pg: 500 Internal Server Error Output: "[✗] proxy: context deadline exceeded". It is causing outages in our staging and prod environment.

sam · October 18, 2021, 8:45pm

Yep, can confirm I’m seeing it within my apps as well.

iiro.krankka · October 18, 2021, 8:46pm

Same here, happened to me twice before and now again.

shaun · October 18, 2021, 8:47pm

We are investigating this issue. We will keep you posted.

shaun · October 18, 2021, 8:59pm

We are experiencing more issues with our legacy consul cluster… We are working to bring it back up and will be working to migrate more folks to our new consul clusters in the meantime.

shaun · October 18, 2021, 9:21pm

The Consul cluster is recovering and we are seeing Postgres clusters begin to recover as well. If anyone is still experiencing issues, please let me know.

kurt · October 18, 2021, 9:25pm

One of the shared consul disks failed. This should be fine, but it seems like the remaining two nodes couldn’t handle the level of traffic Stolon was throwing at them. We’re expanding this consul cluster to 5 nodes, which should let us absorb traffic from a failed node.

sanswork · October 18, 2021, 9:31pm

We’re back, I’ve mentioned before but could your team work on the status site updating policy? We’ve just had a >30 minute outage impacting multiple clients and it didn’t get updated once. I’ve been on your side many times and recognise the stress involved in trying to recover from an outage and the desire to jump in and fix stuff but even just a one liner “Some customers are experiencing database connection issues, we are investigating” at the start would be enough. Knowing that my pagerduty alert is due to a infrastructure outage and not an app failure quickly would improve my life a lot.

sam · October 18, 2021, 9:33pm

Confirmed things are working again here, thanks!

kurt · October 18, 2021, 9:47pm

Yes. Not that it should matter to you all, but this is on our checklist before we take Postgres out of beta. We just hired SREs and are currently getting Postgres to a spot where it can be SRE-ed (vs me and @shaun winging it).

bekit · October 18, 2021, 10:07pm

Thanks for the updates.

I’ll be honest, over the last couple of weeks my team has been discussing where need to draw the line and switch hosts. The number of outages we’ve experienced in the short time since going into production is concerning. We really like a lot of the features and capabilities of Fly, but we’re concerned that an outage at the wrong time during our early stages will be ruin the first impressions of our new company.

Are these issues related exclusively to the semi-managed postgres apps? If I were to bring up our own instances using fly-apps/postgres-ha, would we still have these issues?

We’re meeting tomorrow to discuss this. Unfortunately, if I don’t have a plan for how to make our setup more reliable on Fly, my recommendation is going to be for us to migrate hosts and re-evaluate Fly dafter there’s been some more time for stability to improve.

sam · October 18, 2021, 10:11pm

It’s worth keeping in mind that the Postgres service is currently in beta. I would be concerned if the core service was having outages like this but I think if you’re going to adopt a beta service, some level of downtime is to be expected.

kurt · October 18, 2021, 10:11pm

The Postgres specific outages have all been specific to our Postgres setup (specifically Stolon + shared tenant Consul for leader elections). Stolon and shared tenant Consul are incredibly brittle, we’ve improved it a bunch but it’s still surprising.

To get more reliable Postgres on Fly you could:

Run a single Postgres database (not using our HA) with no automated leader election.
Create your own Consul cluster, set the internal URL to that cluster in your ENV, and then deploy

The irony of our clustered, high availability Postgres setup causing downtime is not lost on me.

kurt · October 18, 2021, 10:12pm

Everyone’s been very patient with us in the forums about the outages. We’re not happy with the reliability, beta or no. It should be working better than this. We also feel warm fuzzies when people defend us but we don’t want to make excuses here!

kurt · October 18, 2021, 10:17pm

One other thing, I just noticed your Postgres DB is old. If you run fly image update it’ll get all our stability fixes. The most recent PG builds didn’t experience the same issues during this Consul outage (we relaxed a lot of stolon strictness checks and changed to haproxy).

sam · October 18, 2021, 10:20pm

Interesting, didn’t even know that was an option. Running the update now.

bekit · October 18, 2021, 10:21pm

I appreciate that. I get that this is new as well, and you’re doing some unique things. I totally expect things like instance restarts and occasional blips, but I also need to make sure I’m doing what’s right for our infrastructure. If nothing else, I’ll likely continue to use Fly for my hobby projects, but I need to ensure stability for something that we’re going to have paying end users on.

Regarding the “beta” tag on Postgres. Where is this listed? I hadn’t actually realized that the postgres deployments were beta until I started frequenting the forums. I just went back and checked, and it doesn’t look like it’s ever mentioned that Postgres isn’t production ready in the documentation here: Postgres on Fly or here: Multi-region PostgreSQL (fly.io).

bekit · October 18, 2021, 10:25pm

I thought updates were done automatically on the managed instances based on the response here: Unrequested postgres upgrade - #2 by kurt. Is that not the case?

sam · October 18, 2021, 10:25pm

I think I noticed it on a blog post or a forum post but I just went back and checked as well and you’re right, you can’t see it anywhere

kurt · October 18, 2021, 10:26pm

We stopped doing as many auto updates after your forum post and shipped the manual updater instead: Early look: PostgreSQL on Fly. We want your opinions. - #108 by shaun

Topic		Replies	Views
Postgres DBs throwing alerts	10	554	October 2, 2021
Deploys failing LHR, postgres proxy failures, intermittent db connection issues	5	465	October 20, 2022
Postgres "failed to connect to proxy: context deadline exceeded" Questions / Help postgres	19	2196	October 14, 2023
Possible issue with database	27	3359	March 2, 2022
Postgres reliability updates and etcd shenanigans	4	1204	July 2, 2022

Postgres database apps are crashing again

Related topics