DNS over TCP works, but UDP doesn't

If you get errors from UDP lookups again, will you post back here? I’m glad it’s working now! I’d like to make sure it continues. :slight_smile:

1 Like

I just did a deploy and am having the same issue again.

TCP works but UDP doesn’t.

1 Like
$ dig +tcp +short 1-2-3-4.abcd.u.channelsdvr.net @ipdns1.channelsdvr.net
1.2.3.4

$ dig +short 1-2-3-4.abcd.u.channelsdvr.net @ipdns1.channelsdvr.net
;; connection timed out; no servers could be reached

That UDP dig is working from where I am. Will you hit https://debug.fly.dev and tell us what the Fly-Region header says?

Fly-Region: lax

Tried both my fly IPs and same thing over UDP:

; <<>> DiG 9.10.6 <<>> 192-168-1-1.bbb4173bbf21.u.channelsdvr.net @213.188.216.24
;; global options: +cmd
;; connection timed out; no servers could be reached


; <<>> DiG 9.10.6 <<>> 192-168-1-1.bbb4173bbf21.u.channelsdvr.net @213.188.214.67
;; global options: +cmd
;; connection timed out; no servers could be reached

I tried from a location near IAD and it works there:

% curl -s debug.fly.dev | grep Region
Fly-Region: iad
% dig 192-168-1-1.bbb4173bbf21.u.channelsdvr.net @213.188.216.24

; <<>> DiG 9.10.6 <<>> 192-168-1-1.bbb4173bbf21.u.channelsdvr.net @213.188.216.24
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 29954
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;192-168-1-1.bbb4173bbf21.u.channelsdvr.net. IN A

;; ANSWER SECTION:
192-168-1-1.bbb4173bbf21.u.channelsdvr.net. 604800 IN A	192.168.1.1

;; Query time: 38 msec
;; SERVER: 213.188.216.24#53(213.188.216.24)
;; WHEN: Wed Jan 05 19:34:35 EST 2022
;; MSG SIZE  rcvd: 129

Ok yeah something is weird here! We’re looking, you can see the effect from different regions with this tool: Ping, mtr, dig and TCP port check from multiple locations

1 Like

Should I try redeploying?

Last time I removed the tcp service and then it started working over UDP reliably.

I don’t want to disrupt your debugging, but this is also affecting our production services now.

Yeah feel free, I’m not sure it’ll help but it won’t hurt what we’re looking at.

FYI it didn’t help. I switched over our services to a backup dns provider for now.

We think there’s a bug that keeps old VMs in our edge UDP mappings after they’ve gone away. This means that UDP packets are getting sent to now dead VMs based on sort order. This is definitely something we can fix.

1 Like

Not sure if the bug was fixed already, or the stale entires simply expired, but it is working again.

I think your deploy actually fixed it, believe it or not. We’re going to track down this bug, but if you experience it again try doing this:

Run fly status to get a list of VMs. Then run fly vm stop <id> on any of them. There’s some kind of stale data that a deploy seems to flush, stopping a VM could have the same effect.

2 Likes

Before going to try CoreDNS, I got the simple example Trivial TCP/UDP Echo Service working in FRA with the following modification in main.go

//      port int = 5000
        port int = 53

and using this

# fly.toml file generated for os1 on 2022-05-01T11:15:54+02:00

app = "os1"

kill_signal = "SIGINT"
kill_timeout = 5
processes = []

[env]
  ECHO_PORT = 53

[experimental]
  allowed_public_ports = []
  auto_rollback = true

[[services]]
  internal_port = 53
  protocol = "udp"

  [[services.ports]]
    port = "53"

[[services]]
  internal_port = 53
  protocol = "tcp"

  [[services.ports]]
    port = "53"

Test with NetCat from an OpenBSD host:
OK for IPv4 TCP & UDP, IPv6 TCP, but KO for IPv6 UDP (the latter is documented elsewhere as still pending):

[rs@gate:~]$ nc -t -4 os1.fly.dev 53 
qwe
qwe
123
123
^C
[rs@gate:~]$ nc -u -4 os1.fly.dev 53 
sdf
sdf
xvcb
xvcb
^C
[rs@gate:~]$ nc -t -6 os1.fly.dev 53 
yxc
yxc
^C
[rs@gate:~]$ nc -u -6 os1.fly.dev 53 
qwert
^C
[rs@gate:~]$
1 Like

My services stopped responding today. A restart fixed it.

App = channelsdvrnet-dns

All of them, or just the UDP part?

I’m not sure, I should have tried dig tcp before restarting.

I’m wondering if there’s a way to setup a health check via fly.toml that would verify a udp dns response

Edit: issue started around 10:25am PST and lasted until I restarted at 2:15pm

(1) there might not be right now (i’ll go check)

(2) i’m kind of kicking myself for not thinking of having a DNS healthcheck, and thanks for bringing that up.

Most of our health checks are run through Consul, which runs health checks locally, so there might be limited value in the simplest DNS health checks we can do, but there’s probably an “off-net” thing we could do here. I can’t promise a timeline (we do already do off-net monitoring for UDP on our platform, but they’re not as particular as specific DNS queries for specific apps), but I think this might be worth investigating.

1 Like

Locally (in vm)? Ref: Healthchecks and private networks - #2 by kurt