Postgres app randomly not accessible by any of my apps!

Another scary finicky experience that started happening the last hour, where everything seems perfectly fine with our postgres app but none of our other apps will connect.

The only guess is credentials, as I had to hack the DATABASE_URL to include a ?schema=public since no one has helped with this issue: Expanding on ENV variables

I had a ton of issues attaching and detaching so i had to create a new database, which was already a bad experience to begin with, so then I just created a single DATABASE_URL from one app and just applied that same URL to all apps, so not sure what the attach magic does, but if it creates new pg auth username/passwords and “rotates” them, then maybe that is what happen during a deploy to an app server?

Either way, this whole experience with internal network connections and getting apps to be able to talk to each other has been very finicky.

The types of issues like this one could potentially take our entire system down, which can not happen as we are in the ecommerce space where our client’s client’s are trying to checkout 24/7 on our platform.

fly pg attach and fly pg detach simply almost never work, I get “Error An unknown error occured” about 95% of the time I try to use it.

So now I am in the worst possible circumstances, every app (10-15 apps) are down because PG is not accessible to any of them, but there is no issues with the pg instance. We even restarted the instance, nothing. I also have no way of knowing what the issue is, nor can I fix the issue. Luckily we are just getting all the production stuff deployed this week so we don’t currently have any traffic on these servers, but if this happens come Monday, we will have some VERY BAD problems, what happens when this happens at 2am? That just simply can not happen.

Something doesn’t add up here! The pg attach command just creates a db/user in the cluster and then sets the DATABASE_URL secret, it doesn’t do anything else. There’s no credential rotation like on Heroku. The attach command was designed for a one to one app to database configuration, though, it’s buggy when running against an existing database.

The detach command does remove credentials, though. If you used a connection string from a previous attach, then run detach it could break other app connections.

What is breaking on your apps? Are you getting particular DB connection errors?

Gotcha, makes sense.

I have never ran detach on this db since creation.

I only thought to maybe attach the db again to get something working, but that failed too.

Basically I attached all, then used the connection url from the first attach, appended the public schema, then set the DATABASE_URL secret on all the apps to that same url and everything was fine until randomly they all dropped with failed to connect issues.

There is a small chance this happen around the time one of the app servers was deployed, but I wouldn’t think this would effect the db on all the apps?

Do you have the actual error from the apps that are saying failed to connect? They’ll usually include “authorization error” or similar if it’s a credentials issue.


edit: I found one of the logs, it’s saying it can’t reach the postgres hostname. I’m digging a little, it’s probably not related to attach or anything.

This could be an Alpine DNS error. How hard would it be to rebuild your images with Debian Slim? I would almost bet money that’ll fix your issues.

Alpine has DNS bugs that just randomly crop up, it’s infuriating.

Interesting, so which image is this that needs to change?

The app images. They need to be able to lookup <database>.internal to connect.

There don’t seem to be any good fixes for Node apps that hit this problem. It comes up so much, we might need to put up all kinds of scary warnings when people deploy Alpine based apps though. The bugs just don’t get any attention: EAI_AGAIN using alpine · Issue #1030 · nodejs/docker-node · GitHub

Damn, this is very frustrating indeed.

I am wondering if this happen when we switched to the local builders, but the dockerfiles we supply to fly deploy have not changed in the last few days.

I’m no docker expert that is for sure, so I am at the moment confused if this is the docker build image on circle (I doubt it) vs the app docker image (ie the FROM) within the dockerfile supplied to fly.

If so, (FROM node:current-alpine) has been what we have provided since we first deployed.

What would you suggest we use instead?

I doubt this happened because you changed anything. The DNS bugs in Alpine just happen spontaneously on previously working processes. It’s brutal.

Try FROM node:current-slim. If you’re installing any additional packages, you’ll need to change anything that runs apk add to the apt-get equivalent. Feel free to post your Dockerfile here if you want us to look.

Wow that is brutal! Haha

FROM node:current-alpine

WORKDIR /app

COPY package.json .

RUN yarn install

COPY . .

CMD ["node", "dist/server.js"]

This is the dockerfile for one of the api apps that try’s to connect to the pg app

Oh yeah that one’s easy. Just FROM node:current-slim and you’re good to go.

1 Like

I will give that a try and report back, what a strange bug!

While on this topic I would love to clean up the attach credentials and make the change suggest in the other thread with adding schema to the url.

What is the best way to clean all that up and get the native DATABASE_URL back in the app using the local apps credentials.

A few hours ago one of my Fly apps also stopped being able to connect to PostgreSQL. This is being deployed from a preexisting Docker image that uses python:alpine3.14, and trying to build from python:slim breaks the apk commands. Is there any easy fix here?

If you fly vm stop <id> it might start working again (temporarily). Post your Dockerfile here and we can tell you how to replace the apk lines.

1 Like

Here is the link to the Dockerfile I’m using: shynet/Dockerfile at master · milesmcc/shynet · GitHub. Thank you!

Just gave that a try and it looks like the build / release failed. I will be back online shortly, just made that last change from mobile to trigger a deploy haha.

Just tried again, with node:latest

UNCAUGHT ERR [GraphQLError [Object]: Can't reach database server at 'better-cart-postgres.interna' l: '5432'

Seems there is possibly something else at hand here? I am at a complete loss.

Which app is this? The last few you deployed seem to be working?

1 Like

You know what, one just came back up, it might have been hitting a region with an old version still running.

I just saw it working now with FROM node:latest :tada:

How can we confirm this is not temporary?