:nxdomain errors when connecting to Postgres after upgrade

kurt · January 23, 2023, 4:00pm

This should be fixed now. Our internal DNS database diverged on one physical host in Miami. For weird, complicated reasons the deploy_command VMs were all getting scheduled there. So those were failing, while the rest of the app might be running normally.

Apps with 2+ instances that were already running continued to function. New instances would fail, then get rescheduled on other hardware. Apps with 1 instance that rebooted may have stopped working, we saw at least one get rescheduled repeatedly on the same bad host.

We’re rolling out health checks to detect this specific issue in the future. This was a first for internal DNS.