.internal DNS occasionally stops working for some apps?

jamesbirtles · June 23, 2022, 8:54am

Every now and then I seem to have an app stop advertising itself via the <appname>.internal address.

Attempts to ping it result in

ping6: getaddrinfo -- nodename nor servname provided, or not known

And running

dig +short txt _apps.internal

Results in a list that doesn’t contain the app that has stopped working.

The app is online and accessible externally, but internally the DNS is not working.

Only thing that seems to fix it is restarting the problem app.

thomas · June 23, 2022, 2:29pm

What’s the name of the app? We log when lookups fail, so I can probably track down what’s happening here.

jamesbirtles · June 23, 2022, 2:39pm

the last one it happened on was called aircast-api

thomas · June 23, 2022, 2:58pm

Yep, I see the errors, including a couple from today. I’ll dig in and see what orchestration events happened at the same time.

I’m monitoring these failures in general; we have a sort of low-rumbling concern that we’re seeing more internal DNS errors recently, but the level is pretty consistent (we get a lot of DNS failures from recurring lookups for the wrong name, which dominate the metric). But yours is a smoking gun. Thanks!

I’ll let you know what I find out.

jamesbirtles · June 24, 2022, 9:13am

thanks, just an fyi that it has just happened again on the same app

jamesbirtles · June 24, 2022, 9:47am

it looks like the app actually crashed last night, I wonder if thats related?

alexb · June 28, 2022, 6:41pm

Just want to chim in and say that we are observing the exact same issue and behavior. I can reproduce the exact same steps as @jamesbirtles when this happens.

I’d say that we have way more issues with the FRA region than any other.

Happy to help troubleshoot this.

jamesbirtles · July 13, 2022, 1:36pm

I’ve just had a regular deploy trigger the same issue. Are we any closer to a fix?

thomas · July 15, 2022, 8:52pm

Just a quick update:

We believe we’ve isolated this problem to a particular pair host worker hosts in our network that somehow briefly had colliding IP addresses in our WireGuard mesh.