DNS servers come in two flavors: authoritative and recursive, and here at Fly.io we have to run infrastructure for both. We run an Anycast authoritative DNS for domains that we own, that is, (mostly) fly.dev. On each server, we need a local, authoritative DNS for .internal pseudo-domains used for private networking, and at the same time we also have to provide a recursive DNS server for machines to access every other domain on the internet.
.internal resolution is handled by something we call corro-dns, referring to Corrosion, our state propagation and service discovery system. But because it is set as the default DNS of all Fly Machines, it also needs to handle non-.internal domains and resolve them recursively. This is notoriously hard to implement correctly, and we do not want to simply delegate to an existing public resolver like 1.1.1.1 or 8.8.8.8 since it is possible for us to get rate-limited. So instead, on each server we run an unbound instance configured in recursion mode, and corro-dns simply sends anything it can’t handle to that local unbound instance.
This worked reasonably well for a long time. But as the number of machines we host grows, we start to run into performance issues of unbound from time to time. Every time this happens, we tune something in unbound’s myriad of performance-related settings. However, even with a lot of tuning, we have been receiving reports of random DNS timeouts or failures, seemingly due to queuing issues during request spikes to unbound. Even with RFC8767 optimizations turned on (serving expired cache in case of an upstream timeout), unbound is still timing out some queries especially in regions with higher network latency.
We’re sure that there’s a certain combination of flags that would resolve these random errors, but since we’re already running our own DNS “proxy” (corro-dns) in front of unbound, we felt that it is probably easier to handle this in a codebase that we understand well. As an experiment, we added a simple query deduplication and caching layer in corro-dns before it sends anything to unbound, and enabled it in a couple of regions where we received a lot of alerts or user reports of DNS timeouts.
That seems to have worked reasonably well and has taken a lot of load off unbound. Our own monitoring no longer complains about error spikes pretty much every single day. unbound metrics are looking significantly happier with much shorter queue lengths and virtually no timeouts from its side. Since we’ve also implemented our own RFC8767 optimization in corro-dns, even if unbound still timeouts somehow, these are unlikely to result in app-visible timeouts.
We’re now in the process of rolling this out to all regions, and hopefully you’ll be seeing improvements in DNS reliability in your machines soon as a result! This week (at the time of writing) has been an unexpected DNS-week, but we’re happy that we’ve ticked a lot of issues off our list, both for public DNS and the internal one. As always, stay tuned for more future updates from us!