DNS over TCP works, but UDP doesn't

I want to be able to run coredns on both UDP and TCP port 53. How can do I do that with my fly.toml? I have this currently, but it seems to be listening on TCP only:

[[services]]
  internal_port = 53
  protocol = "udp"

  [[services.ports]]
    port = 53

[[services]]
  internal_port = 53
  protocol = "tcp"

  [[services.ports]]
    port = 53

This should work, but the configuration is a little persnickity.

Can you show us your CoreDNS config? In particular, which IP are you binding to? UDP responses have to come from the fly-global-services IP. That hostname is defined in /etc/hosts. Some libraries that bind to 0.0.0.0 don’t return packets from the right IP, they use the first IP configured on the interface. It’s possible UDP DNS isn’t working for this reason. More details here: UDP reply from unexpected source - #4 by conblem

I’ve gotten CoreDNS running on UDP with this config:

. {
    health
    bind 0.0.0.0
    bind ::
    whoami
    log
    errors

    redis {
        address localhost:6379
        prefix fly-dns:
    }
}

Here’s my Corefile: fly-coredns/Corefile at fancybits · fancybits/fly-coredns · GitHub

@michael Does that work over both TCP and UDP? I have it working with UDP with the default GitHub - fly-apps/coredns: Authoritative CoreDNS on Fly.io example, but I want to bind to both TCP and UDP

@tmm1 I just tried with the CoreDNS sample and got both TCP and UDP working. The bind :: is no longer needed it seems.

Here’s the fly.toml file

app = "damp-bird-1643"

kill_signal  = "SIGINT"
kill_timeout = 5

[[services]]
internal_port = 53
protocol      = "udp"

  [[services.ports]]
  port = 53

[[services]]
internal_port = 53
protocol      = "tcp"

  [[services.ports]]
  port = 53

Run dig with a UDP and TCP query:

dig +notcp @damp-bird-1643.fly.dev example.com
dig +tcp @damp-bird-1643.fly.dev example.com

And in the app logs

2021-10-26T16:26:02.050 app[a5cc1204] sea [info] [INFO] 219.64.132.129:59459 - 3852 "A IN www.example.com. udp 44 false 4096" NOERROR qr,aa,rd 144 0.000142654s
2021-10-26T16:26:03.278 app[a5cc1204] sea [info] [INFO] 185.201.121.211:51766 - 50882 "A IN www.example.com. tcp 44 false 65535" NOERROR qr,aa,rd 144 0.00013316s
1 Like

I tried it again and its working this time. Not sure what happened before, but I appreciate the help!

2 Likes

If you get errors from UDP lookups again, will you post back here? I’m glad it’s working now! I’d like to make sure it continues. :slight_smile:

1 Like

I just did a deploy and am having the same issue again.

TCP works but UDP doesn’t.

1 Like
$ dig +tcp +short 1-2-3-4.abcd.u.channelsdvr.net @ipdns1.channelsdvr.net
1.2.3.4

$ dig +short 1-2-3-4.abcd.u.channelsdvr.net @ipdns1.channelsdvr.net
;; connection timed out; no servers could be reached

That UDP dig is working from where I am. Will you hit https://debug.fly.dev and tell us what the Fly-Region header says?

Fly-Region: lax

Tried both my fly IPs and same thing over UDP:

; <<>> DiG 9.10.6 <<>> 192-168-1-1.bbb4173bbf21.u.channelsdvr.net @213.188.216.24
;; global options: +cmd
;; connection timed out; no servers could be reached


; <<>> DiG 9.10.6 <<>> 192-168-1-1.bbb4173bbf21.u.channelsdvr.net @213.188.214.67
;; global options: +cmd
;; connection timed out; no servers could be reached

I tried from a location near IAD and it works there:

% curl -s debug.fly.dev | grep Region
Fly-Region: iad
% dig 192-168-1-1.bbb4173bbf21.u.channelsdvr.net @213.188.216.24

; <<>> DiG 9.10.6 <<>> 192-168-1-1.bbb4173bbf21.u.channelsdvr.net @213.188.216.24
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 29954
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;192-168-1-1.bbb4173bbf21.u.channelsdvr.net. IN A

;; ANSWER SECTION:
192-168-1-1.bbb4173bbf21.u.channelsdvr.net. 604800 IN A	192.168.1.1

;; Query time: 38 msec
;; SERVER: 213.188.216.24#53(213.188.216.24)
;; WHEN: Wed Jan 05 19:34:35 EST 2022
;; MSG SIZE  rcvd: 129

Ok yeah something is weird here! We’re looking, you can see the effect from different regions with this tool: Ping, mtr, dig and TCP port check from multiple locations

1 Like

Should I try redeploying?

Last time I removed the tcp service and then it started working over UDP reliably.

I don’t want to disrupt your debugging, but this is also affecting our production services now.

Yeah feel free, I’m not sure it’ll help but it won’t hurt what we’re looking at.

FYI it didn’t help. I switched over our services to a backup dns provider for now.

We think there’s a bug that keeps old VMs in our edge UDP mappings after they’ve gone away. This means that UDP packets are getting sent to now dead VMs based on sort order. This is definitely something we can fix.

1 Like

Not sure if the bug was fixed already, or the stale entires simply expired, but it is working again.

I think your deploy actually fixed it, believe it or not. We’re going to track down this bug, but if you experience it again try doing this:

Run fly status to get a list of VMs. Then run fly vm stop <id> on any of them. There’s some kind of stale data that a deploy seems to flush, stopping a VM could have the same effect.

3 Likes