:nxdomain errors when connecting to Postgres after upgrade

sax · January 20, 2023, 8:46pm

I upgraded the postgres app attached to my running Phoenix application, and from then on I have received :nxdomain errors from any new instance (including containers running my release steps).

IPv6 is correctly configured in the app—it was working fine before the upgrade.

I’ve tried restarting the postgres app, re-attaching it to my app, and lots of fiddling in the last running app itself. The old app instance continued to work until it eventually restarted… now my app is in a broken state with no instances able to connect to the DB, and deploys failing because the release step can’t connect.

sax · January 20, 2023, 8:52pm

Update: I just commented out the release command from my fly configuration, and a deploy successfully ran. My new instance can connect to the db.

When I bring back the release command, that fails on the next deployment.

So current status is: app instance is running, able to connect to the db; new deployments that attempt to run db migrations fail, with :nxdomain errors in the release command.

cbortz · January 21, 2023, 6:33pm

+1 I am getting the same error on my Phoenix project. In my case, I didn’t even upgrade the Postgres app.

kurt · January 21, 2023, 7:19pm

These look like the release command is somehow not taking the IPv6 settings.

In the logs, you should see something like top2.nearest.of.<dbname>.internal. Check and see if fly dig top2.nearest.of.<dbname>.internal gives results.

Are these apps you setup with fly launch, or did you configure them manually for IPv6 with this guide? Legacy pre v1.6.3 Phoenix App · Fly Docs

sax · January 21, 2023, 7:38pm

When I run fly dig with my dbname, it gives back the expected IPv6 address.

One quirk is I can’t get into any of the images that exhibit the problem.

fly ssh console
Connecting to <address>... complete
Error error connecting to SSH server: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain

When I use fly ssh console -v and pick my older instance, I can get into it. When I run dig top2.nearest.of.<dbname>.internal from that instance, it gives back the expected address.

sax · January 21, 2023, 7:40pm

The app was set up with fly launch, then tweaked a bit following the guide.

My rel/env.sh.eex includes the following:

if $(grep -q fly-local-6pn /etc/hosts); then
  ip=$(grep fly-local-6pn /etc/hosts | cut -f 1)
  export ECTO_IPV6=true
  export ERL_AFLAGS="-proto_dist inet6_tcp"
  export RELEASE_DISTRIBUTION=name
  export RELEASE_NAME=app
  export RELEASE_NODE=$FLY_APP_NAME@$ip
fi

kurt · January 21, 2023, 8:13pm

Which app is this? I don’t see an app on your account with two instances, but if you can’t SSH to one of them that’s a problem (maybe a different problem, but I’d like to look).

sax · January 21, 2023, 8:24pm

This is lakehouse-staging in the Cut Time org.

tom93 · January 23, 2023, 3:14am

There are also others with similar SSH errors from the last day or so:

ssh into an app fails: ssh: unable to authenticate, attempted methods [none publickey]
Could not translate host name "top1.nearest.of.***.internal" to address: Name or service not known

Edit: The logs from some of the other people with SSH errors indicate it’s due to errors resolving _orgcert.internal ([1], [2]):

unexpected error: transient SSH server error: can't resolve _orgcert.internal
unexpected error: [ssh: no auth passed yet, transient SSH server error: can't resolve _orgcert.internal]

kurt · January 23, 2023, 4:00pm

This should be fixed now. Our internal DNS database diverged on one physical host in Miami. For weird, complicated reasons the deploy_command VMs were all getting scheduled there. So those were failing, while the rest of the app might be running normally.

Apps with 2+ instances that were already running continued to function. New instances would fail, then get rescheduled on other hardware. Apps with 1 instance that rebooted may have stopped working, we saw at least one get rescheduled repeatedly on the same bad host.

We’re rolling out health checks to detect this specific issue in the future. This was a first for internal DNS.

och1 · January 23, 2023, 2:44pm

Also having this issue when we tried to deploy today. It was fine 5 days ago and we have made no changes to our production infra. We have 2 instances in mia region and we are getting the Can't reach database server error all of a sudden. Maybe an issue with mia region?

moomerman · January 24, 2023, 9:24am

We’re getting this same error this morning when trying to deploy to our instance in london:

Error: 	 09:20:39.893 [error] Postgrex.Protocol (#PID<0.137.0>) failed to connect: ** (DBConnection.ConnectionError) tcp connect (top2.nearest.of.atlas-review-db.internal:5432): non-existing domain - :nxdomain

We deployed fine yesterday and nothing else changed, I simply updated an existing PR

[edit] perhaps related to Issues in LHR region?

GinQuin · January 25, 2023, 11:16am

Indeed, it has been fixed for us. We had applications running at Miami. Thanks!

och1 · January 25, 2023, 3:56pm

Fixed for our Miami region instances as per above note from @kurt