.internal DNS occasionally stops working for some apps?

Every now and then I seem to have an app stop advertising itself via the <appname>.internal address.

Attempts to ping it result in

ping6: getaddrinfo -- nodename nor servname provided, or not known

And running

dig +short txt _apps.internal

Results in a list that doesn’t contain the app that has stopped working.

The app is online and accessible externally, but internally the DNS is not working.

Only thing that seems to fix it is restarting the problem app.

What’s the name of the app? We log when lookups fail, so I can probably track down what’s happening here.

the last one it happened on was called aircast-api

Yep, I see the errors, including a couple from today. I’ll dig in and see what orchestration events happened at the same time.

I’m monitoring these failures in general; we have a sort of low-rumbling concern that we’re seeing more internal DNS errors recently, but the level is pretty consistent (we get a lot of DNS failures from recurring lookups for the wrong name, which dominate the metric). But yours is a smoking gun. Thanks!

I’ll let you know what I find out.

thanks, just an fyi that it has just happened again on the same app

it looks like the app actually crashed last night, I wonder if thats related?

Just want to chim in and say that we are observing the exact same issue and behavior. I can reproduce the exact same steps as @jamesbirtles when this happens.

I’d say that we have way more issues with the FRA region than any other.

Happy to help troubleshoot this.

I’ve just had a regular deploy trigger the same issue. Are we any closer to a fix?

1 Like

Just a quick update:

We believe we’ve isolated this problem to a particular pair host worker hosts in our network that somehow briefly had colliding IP addresses in our WireGuard mesh.