Fly.io operates two DNS services, one internal (for your apps) and one public (for your users). This post is about the latter, our public DNS, and some recent work we have been doing to tidy it up.
We recently made a change to consolidate our DNS and remove the external provider that we used for ACME DNS-01 challenges. (These challenges are used for DNS validation, where we get you to set a CNAME to example.com.xxxxxx.flydns.net to prove to Let’s Encrypt that you control your domain).
This was largely an operational change that shouldn’t be noticed, and should in theory improve our certificate provisioning by removing an external provider’s uptime from the equation. To make this change though, our public DNS service had to become the authoritative nameservers for flydns.net as a whole, which mostly went off without a hitch.
The one symptom we started seeing was reports from customers that they were having trouble resolving their fly.dev domains… sometimes. Invariably when we asked them who their ISP was, it was always AT&T. So we knew something was up with us, or them, or us and them in combination. Skip ahead a few days and lots of debugging, we tracked it down to errant NXDOMAIN responses from our DNS.
When our public DNS received a query for our hardcoded records, if it didn’t find an answer it would return NXDOMAIN, informing you that the domain you asked about does not exist. This isn’t good when the domain does, in fact, exist. Most painfully for us, if you asked for the AAAA records for ns1.flydns.net we would confidently reply that ns1.flydns.net is not real.
This wasn’t a big deal in practice, as when you asked for the A records they would be provided. But it was a big deal if the recursive DNS provider you talked to cached that the domain didn’t exist across any queries, and thus fly.dev had nameservers that did not exist. Thankfully, this seemingly isn’t a failure mode with other DNS servers, and once we tracked the issue down and changed our responses to empty-but-successful everything cleared up.
While we were in the area, we also fixed up a slew of other mostly-boring DNS stuff.
Until recently, we wouldn’t respond to DNS queries over TCP. Our public DNS nabs a couple of IP addresses from our anycast ranges, but a missed configuration meant that TCP traffic to those IP addresses was still being routed into the Fly Proxy. Many parts of Fly.io are Fly Apps, but DNS is not one of those parts, so the proxy summarily dropped those connections. This is fixed! You can now dig +tcp straight at our nameservers, if you so choose.
We also improved our compliance with the myriad of DNS RFCs that exist in the world. Most importantly, we now correctly receive and respond to queries that provide EDNS settings, and we behave better when truncating responses.
Finally, we improved the setup of our reverse DNS zones. We’re still finishing this up, but soon you’ll be able to dig -x both IPv4 and IPv6 addresses and trace them back to us.
All of these are things you likely never noticed! It’s rare that your computer would ever talk directly to our nameservers. Instead, you’ll talk to a recursive nameserver, and they tend to be forgiving when an upstream DNS server isn’t perfect. But as we have seen that isn’t always the case, our bad.