UDP network problems?

jbarham · September 10, 2021, 11:53pm

I starting getting errors connecting to my DNS UDP servers about two hours ago (3 PM EST). Not all regions though. Have there been any Fly network changes that might cause that recently?

jbarham · September 12, 2021, 7:18am

I’m still seeing failures trying to query my Fly DNS UDP servers. What’s odd is that the failures depend on where my DNS client is located. The query dig wombatsoftware.com @213.188.197.200 succeeds when I run it on servers in Fremont, CA and Dallas, TX but fails when run on servers in London, UK or Newark, NJ. Any ideas?

thomas · September 12, 2021, 6:08pm

I’m looking into this now. We run off-net UDP DNS probes in a LHR, CDG, FRA, SJC, and a couple other places, and they’re hooked up to metric alerts; they all seem to be fine. But I’m digging in to your particular address right now. What provider are your LHR queries coming from?

thomas · September 12, 2021, 6:26pm

From a Vultr host in LHR, I see UDP packets for your server hitting our LHR edge, and then bouncing to a worker in IAD. On the IAD worker, I can see the UDP DNS queries I’m generating land on your VM intact, but I don’t see responses. As you’ve noticed, TCP DNS works fine.

I’m still poking around, but the next thing I have to debug is why that particular worker is dropping UDP requests.

jbarham · September 12, 2021, 10:19pm

I’m querying from some Linode hosted servers. But I was also seeing the same failures from Pingdom’s probes.

thomas · September 14, 2021, 12:45am

This is definitely not you. We’re still actively investigating it, but the problem is region-specific; pulling IAD out of the region pool for this app may resolve the problem in the short term.

We’ll of course comp you for the outage and then some, and I’ll do a better job moving forward keeping you posted. It’s been a whole day of trying to track this down.

jbarham · September 14, 2021, 1:06am

Appreciate the update, and commiserations on debugging what is evidently a tricky bug!

Note that from my end there’s no particular urgency for a fix. I’ve temporarily switched my live name servers back to my previous non-Fly servers. But I certainly want to switch back to Fly when this issue has been resolved.

thomas · September 14, 2021, 1:20am

It’s very urgent for us! In the immediacy, we’re rolling out a workaround that should just steer UDP workloads away from machines with the problematic kernel, and then I think I have several more hours of kernel bisection ahead of me.

Topic		Replies	Views
London (lhr) region network issues?	20	966	March 9, 2021
UDP timeouts in Europe?	11	816	December 10, 2021
UDP not responding as expected Questions / Help	12	694	April 3, 2023
DNS UDP responses getting lost in Europe	9	695	March 20, 2021
Getting error querying Fly hosted DNS server from another Fly app	11	664	December 11, 2023

UDP network problems?

Related topics