Yes, we’re also having issues with proxying DNS requests through Fly IAD. It has been broken for about 5 hours.
This still is broken.
I have no log lines that I could really provide from a Fly point of view, as I am not running containers in IAD. Any user that is located in Ashburn however, is unable to connect via UDP.
See below:
[root@us100 ~]# dig A speedypage.dev @ns1.srvcp.com +tcp
; <<>> DiG 9.11.36-RedHat-9.11.36-5.el8_7.2 <<>> A speedypage.dev @ns1.srvcp.com +tcp
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 28028
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;speedypage.dev. IN A
;; ANSWER SECTION:
speedypage.dev. 14400 IN A 103.163.187.2
;; Query time: 26 msec
;; SERVER: 37.16.25.85#53(37.16.25.85)
;; WHEN: Tue Mar 21 03:58:27 GMT 2023
;; MSG SIZE rcvd: 59
The above is DIG TCP, which returns as expected. If I then remove +tcp, so DIG uses UDP (and doesn’t have a built in failover mechanism), it times out:
[root@us100 ~]# dig A speedypage.dev @ns1.srvcp.com
; <<>> DiG 9.11.36-RedHat-9.11.36-5.el8_7.2 <<>> A speedypage.dev @ns1.srvcp.com
;; global options: +cmd
;; connection timed out; no servers could be reached
It seems to work fine from here: DNS Checker - DNS Check Propagation Tool
It’s possible it’s failing from one specific region. You can see which region you’re hitting at debug.fly.dev, or possibly with MTR.
Note that UDP will only work for special IPs from within our network. It doesn’t look like you’re trying from a fly hosted VM, but that’s worth knowing.
And last, we only have spotty staff coverage on the forum. If you need one of us to help troubleshoot specific app problems, it would be best to upgrade to a paid plan and use email support. This is probably worthwhile for any UDP service you care about, there’s a lot of weird quirkiness in our UDP stack.
Hi, I’m having a similar issue with UDP for a DNS service I’m trying to run on Fly.io. Incoming packets are being read, but responses don’t seem to be received. I’m binding on fly-global-services:53
, and connecting via a dedicated IPv4 address. https://debug.fly.dev/ yields Fly-Region: ams
.
Any help on this would be greatly appreciated! I’m about 8+ hours in debugging this. FWIW: I’m serving DNS using GitHub - miekg/dns: DNS library in Go. I’m not 100% sure if this library replies to the fly-global-services
address (as stressed in Running Fly.io Apps On UDP and TCP · Fly Docs), but I would be surprised if it didn’t, given it’s using the same net.ListenPacket
as in the example repo.
This isn’t a specific app problem. UDP connectivity is not working in your IAD/Virginia location. The diagnosis I posted above shows a UDP connection, and a TCP one. UDP fails, TCP works. DIG does not re-try when UDP fails, while the tester you linked probably does.
Please note this has been working for nearly a year+ now, and seeing as others in IAD are having the same problem, I am inclined to think there is nothing wrong with the UDP implementation in this place at a fundamental level, more likely something just specific to this location.
debug.fly.dev brings me to Fly/620fe63b, which is fly-region: iad.
Try this and see if you get a response: dig txt debug.fly @debug.ipv4.fly.dev +short
.
I tested your app and I think the VM in ewr
may not be responding to UDP reliably. Our TCP proxy is getting a lot of errors trying to talk to it (75% of these are from your ewr instance, 25% from sin):
could not proxy TCP data to/from instance: failed to copy (direction=server->client, error=Broken pipe (os error 32))
TCP continues to work because we detect issues like this and route TCP to other instances. UDP is dumber, and is easier to lose.
I killed the EWR container previously and unfortunately it made no difference
# dig txt debug.fly @debug.ipv4.fly.dev +short
; <<>> DiG 9.11.36-RedHat-9.11.36-5.el8_7.2 <<>> txt debug.fly @debug.ipv4.fly.dev +short
;; global options: +cmd
;; connection timed out; no servers could be reached
Quick update: I’m doing a bunch more digging on this today and we found something sus on one of our IAD edges, which we’re making ineligible right now. I’m in the middle of debugging this, but I’ll post more when I’ve got more; just so you know we’re on this.
OK, this was tricky, but we think we’ve tracked down what’s happened: the process that updates our BPF routes for UDP services was getting CPU-throttled by systemd
, which had the effect of wedging some stale routes for a small number of services. We tracked this down to ~5 edges (out of several hundred hosts).
In response:
-
We’ve got an alert set up across the fleet for this condition ever recurring
-
We’ve CPU-unthrottled the updating process, which should prevent it from recurring in the first place
-
We’re planning on rewriting this process to use
corrosion
, our internal service discovery system, rather than Consul; it’s Consul tooling friction that’s burning all this CPU to begin with.
It’s fixed.
Thank you Thomas! Second time now you’ve saved the day re: UDP
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.