UDP timeouts in Europe?

julia · November 27, 2021, 1:50am

hello! I’m trying to run a UDP DNS server on fly, and I ran into an issue this morning where my friend in Germany was getting a bunch of DNS timeouts when querying my server.

To reproduce the issue, I tested queries from a range of clients using https://dig.ping.pe/fly-test:A:213.188.214.254 and I’m seeing intermittent timeouts from the clients in Europe. North America and China seem to be mostly ok. The requests from Europe seem to succeed about 50% of the time.

Is there something going on with UDP requests on fly.io in Europe? I think the timeout issue isn’t with my server because my server logs say that these responses are being served in less than 1ms.

I’m only running my server in one region right now (yyz) but it doesn’t seem like that should cause these timeouts, especially since the requests from China are succeeding.

ignoramous · November 28, 2021, 2:12pm

I hope this was intermittent. Are you still seeing this?

thomas · November 28, 2021, 5:01pm

I have the same question — I’m poking around now, and while I am seeing network weirdness in EU (my off-network telemetry hosts in Europe have spotty connectivity right now), I can’t get reliable UDP packet loss anywhere.

I’m still digging into this, but if you’ve got more (or current) information, let me know!

Thanks for calling this out.

julia · November 29, 2021, 2:15am

I’m not seeing the issue anymore, but are there any tests I can run in the future to get more data if I see it again?

julia · November 30, 2021, 12:19am

I’m seeing some timeouts again at https://dig.ping.pe/fly-test:A:213.188.214.254, here’s a screenshot. Like before, it’s very intermittent. I noticed because some of my pingdom checks failed.

kurt · November 30, 2021, 12:22am

Are you able to try https://debug.fly.dev from those hosts somehow?

That DNS test just worked for me. Can you tell if it’s a different set of hosts than last time?

julia · November 30, 2021, 12:27am

Weird, it works for me now too! I definitely saw failures several times in a row a couple of minutes ago. I think it’s a different set of hosts than last time (before I think it was the hosts in the Netherlands and Norway), but I’m not sure.

I don’t have access to the hosts that are failing, but it’s good to know about debug.fly.dev! I’ll try that if I can find a host where I can reproduce the issue.

julia · December 1, 2021, 2:56pm

In case it’s helpful, I saw the timeouts again just now and ran the test 10 times and took 10 screenshots of the failures:

thomas · December 7, 2021, 7:26pm

Hey! Sorry for the late followup on this, but just as a heads up:

I rigged up debug.ipv4.fly.dev, which runs on every worker node in our fleet, to respond to DNS queries (right now, just uptime.fly. and debug.fly.). So you can get a read on which regions of ours are being routed to:

http://dig.ping.pe/debug.fly.:TXT:debug.ipv4.fly.dev

Something is going on with ping.pe's AMS and our AMS. I’m poking at it, but as you’ll probably see, AMS is reachable from a lot of other places.

julia · December 8, 2021, 1:41am

wow I love this debugging endpoint, thank you! I’m excited to find out what you learn – I really don’t understand how to debug network reachability problems yet (or what can cause them)

thomas · December 8, 2021, 2:02am

We’ve temporarily changed the way we’re routing incoming AMS traffic, so that it’s going directly to an edge/worker rather than to an edge-only VM, and that seems to have drastically improved packet loss in AMS (traffic from other neighboring European areas that landed on AMS workers was working, but AMS->AMS was looking weird, which is what prompted us). We’re still investigating what’s up there, and I’ll try to post here when I figure it out.

thomas · December 10, 2021, 10:59pm

Just a quick note that we added some error telemetry to our workers and caught an issue that was impacting a machine in Santiago; we had to reset our XDP code to clear it up. It didn’t impact (or hasn’t yet impacted) Amsterdam yet, but if it does, we’ll get an alert and I’ll let you know. Europe has looked pretty steady for the past several days.

Topic		Replies	Views
UDP network problems?	7	445	September 14, 2021
DNS UDP responses getting lost in Europe	9	695	March 20, 2021
London (lhr) region network issues?	20	966	March 9, 2021
UDP not responding as expected Questions / Help	12	694	April 3, 2023
Getting timeout on digging pihole Questions / Help	8	780	October 24, 2021

UDP timeouts in Europe?

Related topics