I upgraded the postgres app attached to my running Phoenix application, and from then on I have received :nxdomain errors from any new instance (including containers running my release steps).
IPv6 is correctly configured in the app—it was working fine before the upgrade.
I’ve tried restarting the postgres app, re-attaching it to my app, and lots of fiddling in the last running app itself. The old app instance continued to work until it eventually restarted… now my app is in a broken state with no instances able to connect to the DB, and deploys failing because the release step can’t connect.
Update: I just commented out the release command from my fly configuration, and a deploy successfully ran. My new instance can connect to the db.
When I bring back the release command, that fails on the next deployment.
So current status is: app instance is running, able to connect to the db; new deployments that attempt to run db migrations fail, with :nxdomain errors in the release command.
These look like the release command is somehow not taking the IPv6 settings.
In the logs, you should see something like top2.nearest.of.<dbname>.internal. Check and see if fly dig top2.nearest.of.<dbname>.internal gives results.
When I run fly dig with my dbname, it gives back the expected IPv6 address.
One quirk is I can’t get into any of the images that exhibit the problem.
fly ssh console
Connecting to <address>... complete
Error error connecting to SSH server: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain
When I use fly ssh console -v and pick my older instance, I can get into it. When I run dig top2.nearest.of.<dbname>.internal from that instance, it gives back the expected address.
Which app is this? I don’t see an app on your account with two instances, but if you can’t SSH to one of them that’s a problem (maybe a different problem, but I’d like to look).
This should be fixed now. Our internal DNS database diverged on one physical host in Miami. For weird, complicated reasons the deploy_command VMs were all getting scheduled there. So those were failing, while the rest of the app might be running normally.
Apps with 2+ instances that were already running continued to function. New instances would fail, then get rescheduled on other hardware. Apps with 1 instance that rebooted may have stopped working, we saw at least one get rescheduled repeatedly on the same bad host.
We’re rolling out health checks to detect this specific issue in the future. This was a first for internal DNS.
Also having this issue when we tried to deploy today. It was fine 5 days ago and we have made no changes to our production infra. We have 2 instances in mia region and we are getting the Can't reach database server error all of a sudden. Maybe an issue with mia region?