UDP not responding as expected

max1486 · March 18, 2023, 6:52pm

Yes, we’re also having issues with proxying DNS requests through Fly IAD. It has been broken for about 5 hours.

max1486 · March 20, 2023, 2:37pm

This still is broken.

max1486 · March 21, 2023, 2:43am

I have no log lines that I could really provide from a Fly point of view, as I am not running containers in IAD. Any user that is located in Ashburn however, is unable to connect via UDP.

See below:

[root@us100 ~]# dig A speedypage.dev @ns1.srvcp.com +tcp

; <<>> DiG 9.11.36-RedHat-9.11.36-5.el8_7.2 <<>> A speedypage.dev @ns1.srvcp.com +tcp
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 28028
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;speedypage.dev.                        IN      A

;; ANSWER SECTION:
speedypage.dev.         14400   IN      A       103.163.187.2

;; Query time: 26 msec
;; SERVER: 37.16.25.85#53(37.16.25.85)
;; WHEN: Tue Mar 21 03:58:27 GMT 2023
;; MSG SIZE  rcvd: 59

The above is DIG TCP, which returns as expected. If I then remove +tcp, so DIG uses UDP (and doesn’t have a built in failover mechanism), it times out:

[root@us100 ~]# dig A speedypage.dev @ns1.srvcp.com

; <<>> DiG 9.11.36-RedHat-9.11.36-5.el8_7.2 <<>> A speedypage.dev @ns1.srvcp.com
;; global options: +cmd
;; connection timed out; no servers could be reached

max1486 · March 22, 2023, 3:17pm

@thomas @kurt

kurt · March 22, 2023, 3:31pm

It seems to work fine from here: DNS Checker - DNS Check Propagation Tool

It’s possible it’s failing from one specific region. You can see which region you’re hitting at debug.fly.dev, or possibly with MTR.

Note that UDP will only work for special IPs from within our network. It doesn’t look like you’re trying from a fly hosted VM, but that’s worth knowing.

And last, we only have spotty staff coverage on the forum. If you need one of us to help troubleshoot specific app problems, it would be best to upgrade to a paid plan and use email support. This is probably worthwhile for any UDP service you care about, there’s a lot of weird quirkiness in our UDP stack.

dstotijn · March 23, 2023, 11:26pm

Hi, I’m having a similar issue with UDP for a DNS service I’m trying to run on Fly.io. Incoming packets are being read, but responses don’t seem to be received. I’m binding on fly-global-services:53, and connecting via a dedicated IPv4 address. https://debug.fly.dev/ yields Fly-Region: ams.

Any help on this would be greatly appreciated! I’m about 8+ hours in debugging this. FWIW: I’m serving DNS using GitHub - miekg/dns: DNS library in Go. I’m not 100% sure if this library replies to the fly-global-services address (as stressed in Running Fly.io Apps On UDP and TCP · Fly Docs), but I would be surprised if it didn’t, given it’s using the same net.ListenPacket as in the example repo.

max1486 · March 24, 2023, 12:13am

This isn’t a specific app problem. UDP connectivity is not working in your IAD/Virginia location. The diagnosis I posted above shows a UDP connection, and a TCP one. UDP fails, TCP works. DIG does not re-try when UDP fails, while the tester you linked probably does.

Please note this has been working for nearly a year+ now, and seeing as others in IAD are having the same problem, I am inclined to think there is nothing wrong with the UDP implementation in this place at a fundamental level, more likely something just specific to this location.

debug.fly.dev brings me to Fly/620fe63b, which is fly-region: iad.

kurt · March 24, 2023, 2:23am

Try this and see if you get a response: dig txt debug.fly @debug.ipv4.fly.dev +short.

I tested your app and I think the VM in ewr may not be responding to UDP reliably. Our TCP proxy is getting a lot of errors trying to talk to it (75% of these are from your ewr instance, 25% from sin):

could not proxy TCP data to/from instance: failed to copy (direction=server->client, error=Broken pipe (os error 32))

TCP continues to work because we detect issues like this and route TCP to other instances. UDP is dumber, and is easier to lose.

max1486 · March 24, 2023, 2:42am

I killed the EWR container previously and unfortunately it made no difference

# dig txt debug.fly @debug.ipv4.fly.dev +short

; <<>> DiG 9.11.36-RedHat-9.11.36-5.el8_7.2 <<>> txt debug.fly @debug.ipv4.fly.dev +short
;; global options: +cmd
;; connection timed out; no servers could be reached

thomas · March 27, 2023, 6:54pm

Quick update: I’m doing a bunch more digging on this today and we found something sus on one of our IAD edges, which we’re making ineligible right now. I’m in the middle of debugging this, but I’ll post more when I’ve got more; just so you know we’re on this.

thomas · March 27, 2023, 7:49pm

OK, this was tricky, but we think we’ve tracked down what’s happened: the process that updates our BPF routes for UDP services was getting CPU-throttled by systemd, which had the effect of wedging some stale routes for a small number of services. We tracked this down to ~5 edges (out of several hundred hosts).

In response:

We’ve got an alert set up across the fleet for this condition ever recurring
We’ve CPU-unthrottled the updating process, which should prevent it from recurring in the first place
We’re planning on rewriting this process to use corrosion, our internal service discovery system, rather than Consul; it’s Consul tooling friction that’s burning all this CPU to begin with.

max1486 · March 27, 2023, 8:03pm

It’s fixed.

Thank you Thomas! Second time now you’ve saved the day re: UDP

system · April 3, 2023, 8:04pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
DNS over TCP works, but UDP doesn't	29	2953	January 23, 2023
DNS UDP responses getting lost in Europe	9	695	March 20, 2021
UDP network problems?	7	445	September 14, 2021
UDP connectivity dead? Questions / Help	6	542	December 30, 2022
UDP timeouts in Europe?	11	816	December 10, 2021

UDP not responding as expected

Related topics