Fly-ifying our internal DNS

Keen readers of this forum probably know that we run a DNS server on each physical host, answering requests to both internal and external DNS requests. The external part isn’t just simply delegating to a public resolver like 1.1.1.1: we’d get rate-limited quite fast. Instead, we run a full caching recursive DNS resolver service, locally, on each physical host. This does mean we are less likely to get rate-limited on someone else’s terms, but a recursive DNS resolver is heavy, and under load it can start to drop requests, manifesting as DNS lookup errors inside machines. Some apps resolve DNS more than others, which means that some hosts see DNS lookup errors more than others, with no way to “help” each other.

A problem exists for the internal resolver too, though not quite the same but related. Since we started regionalizing our internal state, resolving an .internal domain no longer just means querying a Corrosion locally: we have to “fan-out” DNS resolution requests to other regions in order to, for example, get a list of machines available in that remote region. For the longest time, this was implemented by simply picking a few hosts randomly in the remote region and trying them one by one until we get a successful response. Here’s the problem: the internal DNS resolver, which we call corro-dns, really has no idea about whether a remote host is even healthy or not. Sometimes you may technically be able to reach a host, but it is not in a state where you should be expecting any good response from it; or even if a host is up, a remote node has no idea whether that host’s local DNS service is up. That information is managed by fly-proxy, our load balancer.

These two problems used to be somewhat rare, but alerts related to them became more and more common as we grew. Especially the internal resolver: any time we have even the tiniest flappiness in networking between two regions, we will get a wall of alerts about the internal resolver misbehaving.

Why didn’t we route everything through fly-proxy in the first place? Because it is meant to serve fly apps, so for the longest time we weren’t able to leverage its load-balancing behavior for anything that runs outside of a fly app, and that includes anything that lives on hosts directly. This has, in fact, changed somewhat recently due to unrelated work (which I’m sure you’ll hear about!), and we are able to route basically anything through the fly-proxy LB; however, with one caveat: it is TCP/HTTP-only. UDP services are still not supported first-class here on our platform: they are implemented differently and there is really no “load balancer” for them (this may change in the future!).

The way we worked around this is that we basically implemented DNS-over-HTTPS internally, just without the SSL part. Why HTTP, and not TCP? Because with HTTP, we are able to make use of cool proxy features such as fly-force-region to specify the region in which a DNS request must be handled; this is useful in ensuring that external DNS requests are only handled by the local region (so that we keep the request local and not getting resolved to the other side of the world), while the internal DNS uses this to make sure it is talking to the correct region. As a benefit, we now also get automatic retries, health checks, fallback routing through a third region in case of network failures, basically all features a real Fly app would have for free.

As of now, this has been deployed and enabled across our fleet. I personally haven’t seen a DNS-related flappy alert for a while, so I am personally feeling a lot better with our DNS right now. Of course, there are still a lot planned in terms of rearchitecting our internal services to be more resilient and performant, and I’m sure we will write about them too once the time comes!

4 Likes

Interesting, thanks for the updates.

I’d like to bring up a point that I’m not sure is strictly related to DNS. I imagine it’s more related to the load balancer (unless there’s some possibility of doing this behavior via DNS).

Here’s the scenario:

  • Imagine I have an application with 2 machines to balance load/traffic.

  • Let’s say their soft limit is configured to 5 requests.

  • The application is below the soft limit, but there are still 2 machines active.

  • In that case, why not balance traffic between the two active machines instead of redirecting all traffic to only one?

What I observe is that when I have two machines running, even if the load is below the soft limit, all traffic is routed to the same machine, and the second one only starts receiving traffic when concurrency increases.

I would expect something closer to a round-robin behavior, distributing requests between the active machines, even if it is below the soft limit.

PS: I know about auto-stop machines, that’s not the point here. I’m wondering whether something like this is on the roadmap to be implemented in the future.

Please start a different topic for this, but the quick response here is that I can’t really say anything based on this description. If your two machines are in the same region, then the proxy does already round-robin traffic between machines, as long as both are healthy. However, if they are in different regions, then the proxy will always prioritize ones that are in a region closer to the client.

If you think that this is happening even with machines in the same region, having your app name and/or machine IDs would be helpful. But again, please start a new topic for this.

2 Likes