I starting getting errors connecting to my DNS UDP servers about two hours ago (3 PM EST). Not all regions though. Have there been any Fly network changes that might cause that recently?
I’m still seeing failures trying to query my Fly DNS UDP servers. What’s odd is that the failures depend on where my DNS client is located. The query
dig wombatsoftware.com @220.127.116.11 succeeds when I run it on servers in Fremont, CA and Dallas, TX but fails when run on servers in London, UK or Newark, NJ. Any ideas?
I’m looking into this now. We run off-net UDP DNS probes in a LHR, CDG, FRA, SJC, and a couple other places, and they’re hooked up to metric alerts; they all seem to be fine. But I’m digging in to your particular address right now. What provider are your LHR queries coming from?
From a Vultr host in LHR, I see UDP packets for your server hitting our LHR edge, and then bouncing to a worker in IAD. On the IAD worker, I can see the UDP DNS queries I’m generating land on your VM intact, but I don’t see responses. As you’ve noticed, TCP DNS works fine.
I’m still poking around, but the next thing I have to debug is why that particular worker is dropping UDP requests.
I’m querying from some Linode hosted servers. But I was also seeing the same failures from Pingdom’s probes.
This is definitely not you. We’re still actively investigating it, but the problem is region-specific; pulling IAD out of the region pool for this app may resolve the problem in the short term.
We’ll of course comp you for the outage and then some, and I’ll do a better job moving forward keeping you posted. It’s been a whole day of trying to track this down.
Appreciate the update, and commiserations on debugging what is evidently a tricky bug!
Note that from my end there’s no particular urgency for a fix. I’ve temporarily switched my live name servers back to my previous non-Fly servers. But I certainly want to switch back to Fly when this issue has been resolved.
It’s very urgent for us! In the immediacy, we’re rolling out a workaround that should just steer UDP workloads away from machines with the problematic kernel, and then I think I have several more hours of kernel bisection ahead of me.