DNS over TCP works, but UDP doesn't

Before going to try CoreDNS, I got the simple example Trivial TCP/UDP Echo Service working in FRA with the following modification in main.go

//      port int = 5000
        port int = 53

and using this

# fly.toml file generated for os1 on 2022-05-01T11:15:54+02:00

app = "os1"

kill_signal = "SIGINT"
kill_timeout = 5
processes = []

[env]
  ECHO_PORT = 53

[experimental]
  allowed_public_ports = []
  auto_rollback = true

[[services]]
  internal_port = 53
  protocol = "udp"

  [[services.ports]]
    port = "53"

[[services]]
  internal_port = 53
  protocol = "tcp"

  [[services.ports]]
    port = "53"

Test with NetCat from an OpenBSD host:
OK for IPv4 TCP & UDP, IPv6 TCP, but KO for IPv6 UDP (the latter is documented elsewhere as still pending):

[rs@gate:~]$ nc -t -4 os1.fly.dev 53 
qwe
qwe
123
123
^C
[rs@gate:~]$ nc -u -4 os1.fly.dev 53 
sdf
sdf
xvcb
xvcb
^C
[rs@gate:~]$ nc -t -6 os1.fly.dev 53 
yxc
yxc
^C
[rs@gate:~]$ nc -u -6 os1.fly.dev 53 
qwert
^C
[rs@gate:~]$
1 Like

My services stopped responding today. A restart fixed it.

App = channelsdvrnet-dns

All of them, or just the UDP part?

I’m not sure, I should have tried dig tcp before restarting.

I’m wondering if there’s a way to setup a health check via fly.toml that would verify a udp dns response

Edit: issue started around 10:25am PST and lasted until I restarted at 2:15pm

(1) there might not be right now (i’ll go check)

(2) i’m kind of kicking myself for not thinking of having a DNS healthcheck, and thanks for bringing that up.

Most of our health checks are run through Consul, which runs health checks locally, so there might be limited value in the simplest DNS health checks we can do, but there’s probably an “off-net” thing we could do here. I can’t promise a timeline (we do already do off-net monitoring for UDP on our platform, but they’re not as particular as specific DNS queries for specific apps), but I think this might be worth investigating.

1 Like

Locally (in vm)? Ref: Healthchecks and private networks - #2 by kurt

Hi, we did a deploy today and now this is happening again.

I can do tcp lookups but udp not working.

TCP works:

$ dig +tcp 1-1-1-1.deadbeef.u.channelsdvr.net @ipdns2.channelsdvr.net

;; ANSWER SECTION:
1-1-1-1.deadbeef.u.channelsdvr.net. 604800 IN A	1.1.1.1

;; Query time: 122 msec
;; SERVER: 213.188.216.24#53(213.188.216.24)

UDP no response:

$ dig 1-1-1-1.deadbeef.u.channelsdvr.net @ipdns2.channelsdvr.net

; <<>> DiG 9.10.6 <<>> 1-1-1-1.deadbeef.u.channelsdvr.net @ipdns2.channelsdvr.net
;; global options: +cmd
;; connection timed out; no servers could be reached

After last time we setup a script check, and it is currently still passing when invoking dig against 127.0.0.1. So there’s something about udp routing into our instance that’s broken.

UDP is tricky on Fly. In particular, pay attention to the four quirks mentioned in the docs (if you weren’t already):

But before we get started, there are four gotchas you need to know about.

  • The UDP side of your application needs to bind to the special fly-global-services address. But the TCP side of your application can’t; it should bind to 0.0.0.0.
  • The UDP side of your application needs to bind to the same port that is used externally. Fly will not rewrite the port; Fly only rewrites the IP address for UDP packets.
  • We support IPv6 for TCP, but not for UDP.
  • We swipe a couple dozen bytes from your MTU for UDP, which usually doesn’t matter, but in rare cases might.

Thanks. I did a rollback and its working again. Will try to figure out what changed.

1 Like

Well, things are working again. I isolated each change and deployed it separately, and eventually deployed all the same changes together and its still working. I’m pretty confident its nothing on my side.

This matches my experience in the past where deploying a new container after a while breaks, and then doing a couple more deploys/restarts fixes things magically. I guess the bug referenced above is still present.

1 Like