(1) there might not be right now (iāll go check)
(2) iām kind of kicking myself for not thinking of having a DNS healthcheck, and thanks for bringing that up.
Most of our health checks are run through Consul, which runs health checks locally, so there might be limited value in the simplest DNS health checks we can do, but thereās probably an āoff-netā thing we could do here. I canāt promise a timeline (we do already do off-net monitoring for UDP on our platform, but theyāre not as particular as specific DNS queries for specific apps), but I think this might be worth investigating.
$ dig 1-1-1-1.deadbeef.u.channelsdvr.net @ipdns2.channelsdvr.net
; <<>> DiG 9.10.6 <<>> 1-1-1-1.deadbeef.u.channelsdvr.net @ipdns2.channelsdvr.net
;; global options: +cmd
;; connection timed out; no servers could be reached
After last time we setup a script check, and it is currently still passing when invoking dig against 127.0.0.1. So thereās something about udp routing into our instance thatās broken.
But before we get started, there are four gotchas you need to know about.
The UDP side of your application needs to bind to the special fly-global-services address. But the TCP side of your application canāt; it should bind to 0.0.0.0.
The UDP side of your application needs to bind to the same port that is used externally. Fly will not rewrite the port; Fly only rewrites the IP address for UDP packets.
We support IPv6 for TCP, but not for UDP.
We swipe a couple dozen bytes from your MTU for UDP, which usually doesnāt matter, but in rare cases might.
Well, things are working again. I isolated each change and deployed it separately, and eventually deployed all the same changes together and its still working. Iām pretty confident its nothing on my side.
This matches my experience in the past where deploying a new container after a while breaks, and then doing a couple more deploys/restarts fixes things magically. I guess the bug referenced above is still present.