DNS UDP responses getting lost in Europe

Previously I was seeing DNS UDP responses being lost between my LHR DNS server and London based monitoring agents. That issue was resolved mid last week.

Unfortunately since the 53/tcp port was enabled I’ve been seeing DNS UDP responses from my DNS server running in Fly’s European regions (currently LHR and FRA) being lost when tested by Europe based Pingdom monitoring agents. Note that this is not a London issue, I’m seeing responses frequently not getting through to Pingdom’s monitoring agents in, at least, Zurich, and Falkenberg and Stockholm in Sweden.

Strangely though the responses from the same servers get through to Pingdom’s North American monitoring agents 100% of the time.

I’m almost certain this is not a Pingdom issue as I have an identical Pingdom monitor checking the same DNS server running in Linode’s London DC and it’s getting 100% uptime.

To summarize:

  • DNS UDP responses from my DNS server running in Fly’s LHR and FRA regions are frequently being lost when queried by Pingdom’s Europe based agents
  • Responses from the same servers are always getting through to Pingdom’s North American located agents
  • Responses from the same DNS server software running in Linode’s London DC always get through to Pingdom’s Europe based agents
  • The start of these failures seems to coincide with the enabling of the 53/tcp port

My app name is slickdns. You can check my Fly DNS server by running dig @ wombatsoftware.com.

I’m digging in on this right now but a quick note that 53/tcp probably couldn’t have impacted 53/udp; they take entirely different data paths (53/tcp has an ordinary socket listener — though we do a port remapping for it — and runs from haproxy into fly-proxy, while 53/udp is intercepted by XDP in the kernel, and fly-proxy (+ anything else in userland) never sees it.)

If there’s not some really dumb misconfiguration we can quickly spot here, my next step is just going to be to replicate Pingdom off-network and continuously monitor 53/udp globally.

My useless update is that I didn’t see anything obviously misconfigured, I can do eu->eu DNS, and we need the monitoring anyways, so I’m standing up something to monitor multi-region DNS into Fly continuously (github.com/256dpi/newdns is a strange library, but the thing we needed serverside here sure was its “hello world”).

I’ll keep you updated.

Thanks for the update.

So it seems that enabling 53/tcp likely isn’t the culprit. I just mentioned it because I started seeing UDP failures again after I enabled 53/tcp for my server. I’ve disabled 53/tcp for my DNS (i.e., it’s back to UDP only). I’ll report if that makes a difference.

Hey, sorry about the delayed response on this, but here’s where we’re at:

  • We identified an issue that impacts a small subset of our hosts (it’s a dumb iptables configuration thing having to do with how the interfaces are configured). We’d thought we’d fixed it fleetwide, but nope.

  • The problem was impacting a fra edge node, which .EU requests frequently get routed to.

  • Our provisioning tool now checks for and fixes the problem, fleetwide.

  • I’ve stood up an off-network monitoring system that hits a small cluster of Fly DNS apps from a bunch of locations around the world (basically: most of Vultr’s regions); we’ve got alerts set on DNS (and thus UDP) error rates worldwide now.

Sorry this took so long to resolve! In retrospect I could have wrapped this up faster if I’d started by auditing all our iptables rules.

Thanks again for the update.

Unfortunately I’m not seeing any improvement at my end (after a restart FWIW), i.e., UDP DNS responses from my DNS server are frequently not getting through to Pindom’s EU based probes whereas the same DNS server running on a VM in Linode’s London DC is at 100% uptime.

Don’t know if it makes any difference but I noticed that even though I’ve set LHR and FRA as my app’s regions, the instances are running in LHR and CDG.

I’ve been getting 100% uptime in Europe and North America with my Fly DNS servers over the past 24 hours. Looks like the fix was to enable tcp/53 and to enable a tcp health check for the same port.

So seems like Fly was deeming my app to be unhealthy and dropping UDP responses only when I was running on udp/53 with no health check configured.

What! That shouldn’t have had any effect. It is super interesting to know about that, and gives us something else to investigate.

Hmm, now I’m stumped as to what it is I did that fixed the UDP responses being dropped. Today I tested removing the tcp check from fly.toml, got no errors, then removed the tcp service (i.e., back to udp/53 only) and got no errors. So now it’s back to udp/53 and tcp/53 with the tcp health check enabled and it’s all green.

FWIW I made the tests above by only changing the app’s fly.toml and left the Dockerfile untouched. I don’t know if that makes any difference in terms of re-deployment.

That shouldn’t make any difference! Let us know if you see UDP responses get dropped again. There’s a very good chance that just replacing all your instances is what made it start working, sometimes our fixes require app “reboots” to take effect.