:nxdomain errors when connecting to Postgres after upgrade

I upgraded the postgres app attached to my running Phoenix application, and from then on I have received :nxdomain errors from any new instance (including containers running my release steps).

IPv6 is correctly configured in the app—it was working fine before the upgrade.

I’ve tried restarting the postgres app, re-attaching it to my app, and lots of fiddling in the last running app itself. The old app instance continued to work until it eventually restarted… now my app is in a broken state with no instances able to connect to the DB, and deploys failing because the release step can’t connect.

1 Like

Update: I just commented out the release command from my fly configuration, and a deploy successfully ran. My new instance can connect to the db.

When I bring back the release command, that fails on the next deployment.

So current status is: app instance is running, able to connect to the db; new deployments that attempt to run db migrations fail, with :nxdomain errors in the release command.

1 Like

+1 I am getting the same error on my Phoenix project. In my case, I didn’t even upgrade the Postgres app.

These look like the release command is somehow not taking the IPv6 settings.

In the logs, you should see something like top2.nearest.of.<dbname>.internal. Check and see if fly dig top2.nearest.of.<dbname>.internal gives results.

Are these apps you setup with fly launch, or did you configure them manually for IPv6 with this guide? Legacy pre v1.6.3 Phoenix App · Fly Docs

When I run fly dig with my dbname, it gives back the expected IPv6 address.

One quirk is I can’t get into any of the images that exhibit the problem.

fly ssh console
Connecting to <address>... complete
Error error connecting to SSH server: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain

When I use fly ssh console -v and pick my older instance, I can get into it. When I run dig top2.nearest.of.<dbname>.internal from that instance, it gives back the expected address.

The app was set up with fly launch, then tweaked a bit following the guide.

My rel/env.sh.eex includes the following:

if $(grep -q fly-local-6pn /etc/hosts); then
  ip=$(grep fly-local-6pn /etc/hosts | cut -f 1)
  export ECTO_IPV6=true
  export ERL_AFLAGS="-proto_dist inet6_tcp"
  export RELEASE_DISTRIBUTION=name
  export RELEASE_NAME=app
  export RELEASE_NODE=$FLY_APP_NAME@$ip
fi

Which app is this? I don’t see an app on your account with two instances, but if you can’t SSH to one of them that’s a problem (maybe a different problem, but I’d like to look).

This is lakehouse-staging in the Cut Time org.

There are also others with similar SSH errors from the last day or so:

Edit: The logs from some of the other people with SSH errors indicate it’s due to errors resolving _orgcert.internal ([1], [2]):

unexpected error: transient SSH server error: can't resolve _orgcert.internal
unexpected error: [ssh: no auth passed yet, transient SSH server error: can't resolve _orgcert.internal]
1 Like

This should be fixed now. Our internal DNS database diverged on one physical host in Miami. For weird, complicated reasons the deploy_command VMs were all getting scheduled there. So those were failing, while the rest of the app might be running normally.

Apps with 2+ instances that were already running continued to function. New instances would fail, then get rescheduled on other hardware. Apps with 1 instance that rebooted may have stopped working, we saw at least one get rescheduled repeatedly on the same bad host.

We’re rolling out health checks to detect this specific issue in the future. This was a first for internal DNS.

2 Likes

Also having this issue when we tried to deploy today. It was fine 5 days ago and we have made no changes to our production infra. We have 2 instances in mia region and we are getting the Can't reach database server error all of a sudden. Maybe an issue with mia region?

We’re getting this same error this morning when trying to deploy to our instance in london:

Error: 	 09:20:39.893 [error] Postgrex.Protocol (#PID<0.137.0>) failed to connect: ** (DBConnection.ConnectionError) tcp connect (top2.nearest.of.atlas-review-db.internal:5432): non-existing domain - :nxdomain

We deployed fine yesterday and nothing else changed, I simply updated an existing PR

[edit] perhaps related to Issues in LHR region?

Indeed, it has been fixed for us. We had applications running at Miami. Thanks!

Fixed for our Miami region instances as per above note from @kurt