UDP timeouts in Europe?

hello! I’m trying to run a UDP DNS server on fly, and I ran into an issue this morning where my friend in Germany was getting a bunch of DNS timeouts when querying my server.

To reproduce the issue, I tested queries from a range of clients using https://dig.ping.pe/fly-test:A:213.188.214.254 and I’m seeing intermittent timeouts from the clients in Europe. North America and China seem to be mostly ok. The requests from Europe seem to succeed about 50% of the time.

Is there something going on with UDP requests on fly.io in Europe? I think the timeout issue isn’t with my server because my server logs say that these responses are being served in less than 1ms.

I’m only running my server in one region right now (yyz) but it doesn’t seem like that should cause these timeouts, especially since the requests from China are succeeding.

I hope this was intermittent. Are you still seeing this?

I have the same question — I’m poking around now, and while I am seeing network weirdness in EU (my off-network telemetry hosts in Europe have spotty connectivity right now), I can’t get reliable UDP packet loss anywhere.

I’m still digging into this, but if you’ve got more (or current) information, let me know!

Thanks for calling this out.

I’m not seeing the issue anymore, but are there any tests I can run in the future to get more data if I see it again?

I’m seeing some timeouts again at https://dig.ping.pe/fly-test:A:213.188.214.254, here’s a screenshot. Like before, it’s very intermittent. I noticed because some of my pingdom checks failed.

Are you able to try https://debug.fly.dev from those hosts somehow?

That DNS test just worked for me. Can you tell if it’s a different set of hosts than last time?

Weird, it works for me now too! I definitely saw failures several times in a row a couple of minutes ago. I think it’s a different set of hosts than last time (before I think it was the hosts in the Netherlands and Norway), but I’m not sure.

I don’t have access to the hosts that are failing, but it’s good to know about debug.fly.dev! I’ll try that if I can find a host where I can reproduce the issue.

In case it’s helpful, I saw the timeouts again just now and ran the test 10 times and took 10 screenshots of the failures:

1 Like

Hey! Sorry for the late followup on this, but just as a heads up:

I rigged up debug.ipv4.fly.dev, which runs on every worker node in our fleet, to respond to DNS queries (right now, just uptime.fly. and debug.fly.). So you can get a read on which regions of ours are being routed to:

http://dig.ping.pe/debug.fly.:TXT:debug.ipv4.fly.dev

Something is going on with ping.pe's AMS and our AMS. I’m poking at it, but as you’ll probably see, AMS is reachable from a lot of other places.

2 Likes

wow I love this debugging endpoint, thank you! I’m excited to find out what you learn – I really don’t understand how to debug network reachability problems yet (or what can cause them)

We’ve temporarily changed the way we’re routing incoming AMS traffic, so that it’s going directly to an edge/worker rather than to an edge-only VM, and that seems to have drastically improved packet loss in AMS (traffic from other neighboring European areas that landed on AMS workers was working, but AMS->AMS was looking weird, which is what prompted us). We’re still investigating what’s up there, and I’ll try to post here when I figure it out.

1 Like

Just a quick note that we added some error telemetry to our workers and caught an issue that was impacting a machine in Santiago; we had to reset our XDP code to clear it up. It didn’t impact (or hasn’t yet impacted) Amsterdam yet, but if it does, we’ll get an alert and I’ll let you know. Europe has looked pretty steady for the past several days.