We made DNS lookups from within Machines more reliable

DNS servers come in two flavors: authoritative and recursive, and here at Fly.io we have to run infrastructure for both. We run an Anycast authoritative DNS for domains that we own, that is, (mostly) fly.dev. On each server, we need a local, authoritative DNS for .internal pseudo-domains used for private networking, and at the same time we also have to provide a recursive DNS server for machines to access every other domain on the internet.

.internal resolution is handled by something we call corro-dns, referring to Corrosion, our state propagation and service discovery system. But because it is set as the default DNS of all Fly Machines, it also needs to handle non-.internal domains and resolve them recursively. This is notoriously hard to implement correctly, and we do not want to simply delegate to an existing public resolver like 1.1.1.1 or 8.8.8.8 since it is possible for us to get rate-limited. So instead, on each server we run an unbound instance configured in recursion mode, and corro-dns simply sends anything it can’t handle to that local unbound instance.

This worked reasonably well for a long time. But as the number of machines we host grows, we start to run into performance issues of unbound from time to time. Every time this happens, we tune something in unbound’s myriad of performance-related settings. However, even with a lot of tuning, we have been receiving reports of random DNS timeouts or failures, seemingly due to queuing issues during request spikes to unbound. Even with RFC8767 optimizations turned on (serving expired cache in case of an upstream timeout), unbound is still timing out some queries especially in regions with higher network latency.

We’re sure that there’s a certain combination of flags that would resolve these random errors, but since we’re already running our own DNS “proxy” (corro-dns) in front of unbound, we felt that it is probably easier to handle this in a codebase that we understand well. As an experiment, we added a simple query deduplication and caching layer in corro-dns before it sends anything to unbound, and enabled it in a couple of regions where we received a lot of alerts or user reports of DNS timeouts.

That seems to have worked reasonably well and has taken a lot of load off unbound. Our own monitoring no longer complains about error spikes pretty much every single day. unbound metrics are looking significantly happier with much shorter queue lengths and virtually no timeouts from its side. Since we’ve also implemented our own RFC8767 optimization in corro-dns, even if unbound still timeouts somehow, these are unlikely to result in app-visible timeouts.

We’re now in the process of rolling this out to all regions, and hopefully you’ll be seeing improvements in DNS reliability in your machines soon as a result! This week (at the time of writing) has been an unexpected DNS-week, but we’re happy that we’ve ticked a lot of issues off our list, both for public DNS and the internal one. As always, stay tuned for more future updates from us!

5 Likes

Thanks for sharing this. We started seeing intermittent DNS resolution failures about 5–6 days ago resolving S3 virtual-hosted endpoints (e.g., bucket.s3.us-east-1.amazonaws.com). The failures manifested as hung requests in our app, which repeatedly exhausted instances and caused restarts. After several days of disruptions and a lot of support load, we mitigated by switching our S3 client to path‑style addressing and removing a custom endpoint override. That immediately stabilized things on our side. After many days of pain and taking stabs in the dark.

Your update about query deduplication/caching and “serve stale” in corro-dns lines up with what we saw, and we appreciate the rollout to improve reliability across regions.

If there’s any recommended best practice for endpoint styles on us-east-1 or guidance around expected resolver behavior, we’re all ears. And if anyone else sees this and can’t understand why external api calls may be erroring out… it could be related to this.

Which region are you seeing this from? This should not be causing resolution failures (and from I can see, it actually reduced most of it), unless there’s something very specific about the virtual-host style endpoints from AWS. But then again, if you’re just using the same endpoint every time and not a different one, I don’t know how any change here would affect that. In the absence of process restarts, even a failed upstream DNS query should still allow at least a recently expired cache to be returned to your app.

Would you be comfortable sharing the region(s) your app is in and maybe the name of your app, and what’s the exact error you’re seeing? e.g. Is it a SERVFAIL, NXDOMAIN response from DNS, or simply no response at all?

(I also just took a look at what virtual host-style domains from S3 look like, it seems fine and it doesn’t really make sense why it would cause any DNS lookups to become hung indefinitely. Could it be possible that the vhost-style one got resolved to a different IP and the TCP connection to that IP is what’s actually hung here? It does seem like the vhost-style domain gets resolved consistently to a different set of IPs than the bare S3 domain.)

Thanks for the follow-up. Details below.

App/region

  • App: carebility-prod

  • Primary region: IAD (with some in BOS as well). We also tried scaling up some in DFW for a while and scaled down IAD/BOS for a few days not understanding the problem (initially), but then eventually switched back to split traffic between BOS and IAD.

What we saw

  • Started ~Sep 19–20 (5–6 days ago).

  • Intermittent resolution failures only on S3 virtual-hosted endpoints: carebility-assets-prod.s3.us-east-1.amazonaws.com.

  • App-level error surfaced as Excon::Error::Socket with nested Resolv::ResolvError: “no address for carebility-assets-prod.s3.us-east-1.amazonaws.com”.

  • We didn’t capture the raw DNS RCODE (SERVFAIL/NXDOMAIN). From the Ruby side it was a resolution failure (“no address”) rather than a rejection we could classify. In some bursts, we also observed hung upstream requests leading to Puma thread exhaustion and failing health checks, which triggered repeated restarts.

Why not everything failed

  • Imgix-backed image URLs were unaffected (no S3 DNS at render).

  • Some code paths used different hosts (global/path-style) and kept working, which made the incident look intermittent.

Mitigation (effective)

  • We changed our S3 client config to use path-style addressing and removed a custom endpoint override (which had worked for a year and half, previously). After that, errors stopped completely in our testing and in production traffic after rollout. Honestly, huge relief.

Your point about vhost-style resolving to different IPs than the global S3 host matches our observations: the failures only occurred on the vhost-style name. Path-style (global host) has been consistently healthy for us. That said, our app logs only showed the Ruby resolver error, not the DNS RCODE, so it’s possible part of what we experienced were downstream TCP hangs to a particular IP set as you suggested.

Happy to share more if needed. If helpful, we can temporarily flip a staging Machine back to vhost-style to capture exact DNS responses (including RCODEs) and packet captures for comparison.

Example error line from our logs:

Excon::Error::Socket (no address for carebility-assets-prod.s3.us-east-1.amazonaws.com (Resolv::ResolvError))

Appreciate the recent corro-dns improvements and “serve stale” rollout you announced; our timeline aligns with that change window, and our path-style switch has fully stabilized things on our side.

The changes mentioned in this thread were actually not activated in iad during the time window mentioned. I can take another look tomorrow but it does feel like something else was going wrong here. From the ruby error it does look like something happened with DNS resolution indeed, but I don’t think those would normally cause hung requests – they’d cause requests to fail, otherwise it sounds like a bug in the framework you’re using.

It would definitely help if you can flip a staging app’s machine back to the vhost style urls and we’ll be able to see from our side what’s happening with the DNS.

I don’t know how odd this is but can we continue in email? I don’t mind coming back here and providing documentation/resolution feedback afterward for anyone else who ends up following along. 1) It was definitely dns resolution issues, and it happened on more than one of our servers 2) users started noticing slowdown/intermittent errors Monday morning not the weekend (we were wrong on the days and sentry confirms), 3) we hadn’t deployed, or updated any of our ruby gems for several days prior, so something external happened, and we were scouring the web trying to figure out why 4) this post from you guys originally seemed to line up with what we were seeing (and similar-ish timing).

I found one possible explanation; it seems that our unbound recursor has been occasionally returning TTLs of 0 when resolving S3 domains. Not sure why this wasn’t a problem before, it’s probably not specific to the change mentioned in this post but I observed it happen at least once. The fix to this is currently rolling.

Hi,

I’m seeing Net::OpenTimeout / Faraday::ConnectionFailed errors when connecting from Fly (dfw) to login.salesforce.com:443.

  • It worked fine last night and still works locally.

  • Inside a Fly VM:

    • curl -4 (IPv4) succeeds.

    • curl -6 (IPv6) fails.

  • Ruby/Net::HTTP uses the order returned by DNS. If AAAA is returned first, the request hangs until timeout.

Is this related?

Thanks

No, this is likely unrelated. We cannot affect whether AAAA or A is returned or attempted first – that completely depends on your client and DNS server.

Your error Net::OpenTimeout is also not DNS-related. That’s likely a TCP issue. If you have a support plan, reaching out to support is usually the fastest option; else please open a new topic on the community forum with additional details such as which region you’re seeing this from.

Just to provide an update here: the AWS S3 resolution failures have also been fixed. This isn’t really related to what’s done in the Fresh Produce but it’s a long-time bug that got exposed at around the same time this new feature was being rolled out. You should not be seeing S3 resolution failures anymore.

We probably still have a bunch of DNS standard conformance work to do going forward – stay tuned for more updates!