London (lhr) region network issues?

I’m testing a DNS UDP server in several regions and so far it’s pretty smooth, but I’m getting frequent connection timeouts to the London based instance, both from Pingdom and a personal server in Linode’s London data center.

The London instance appears to be healthy otherwise as it’s calling home to the controlling web server on schedule.

Any ideas?

P.S. Eagerly awaiting the opening of tcp/53! :wink:

Taking a look.

It seems like one of our London servers is having some issues.

@jbarham The situation should’ve been resolved about an hour ago. Is it looking better from your end?

This might be a separate issue.

@jerome Unfortunately I’m still seeing connectivity issues to my London instance, even after a restart.

Outbound HTTP connections from my London instance work fine. It just seems that incoming DNS UDP requests/responses are frequently being dropped.

FWIW I also have instances running in iad and lax and Pingdom has not reported any connection problems with those instances.

Hey! I just took a closer look at LHR and UDP there seems to be working fine (I did manage to confuse myself while debugging, and while I was confused I flushed a route cache, so if your app is suddenly working again, that’d be good to know).

Can you tell me more about your app? What’s it named? I can take a closer look at it for you.

(You can hit us up privately if you’d like).

Still seeing the same issue after your route cache flush.

The app name is slickdns. It’s an authoritative DNS server. You can test it by running dig @213.188.193.79 wombatsoftware.com.

The same app is also running in LAX and IAD and Pingdom (checking every minute) has always been able to get a response from the instances in those regions.

Looking at the LHR logs, it looks like the DNS request is always getting through to the instance but the response is frequently getting dropped.

For added context here are the results I get when I test my Fly nameserver from my London Linode VM:

$ ping 213.188.193.79 
PING 213.188.193.79 (213.188.193.79) 56(84) bytes of data.
64 bytes from 213.188.193.79: icmp_seq=1 ttl=57 time=1.47 ms
64 bytes from 213.188.193.79: icmp_seq=2 ttl=57 time=1.46 ms
64 bytes from 213.188.193.79: icmp_seq=3 ttl=57 time=1.54 ms
^C
--- 213.188.193.79 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 5ms
rtt min/avg/max/mdev = 1.459/1.487/1.537/0.047 ms

DNS query:

$ dig @213.188.193.79 wombatsoftware.com
; <<>> DiG 9.11.5-P1-1ubuntu2-Ubuntu <<>> @213.188.193.79 wombatsoftware.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 2496
;; flags: qr aa rd; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;wombatsoftware.com.		IN	A

;; ANSWER SECTION:
wombatsoftware.com.	3600	IN	A	151.101.65.195
wombatsoftware.com.	3600	IN	A	151.101.1.195

;; Query time: 3 msec
;; SERVER: 213.188.193.79#53(213.188.193.79)
;; WHEN: Mon Mar 01 08:38:00 UTC 2021
;; MSG SIZE  rcvd: 104

The very low ping and response times, when I get a response from my server, suggests that the Linode/Fly/Pingdom servers are all in the same data center, or very close. So maybe this would have some impact on the routing of Fly UDP traffic?

I just wanted to pop in here and say that it’s very cool you’re shipping an authoritative DNS service on Fly. :smiley:

We’ll get the weird timeouts in LHR figured out. What region is that Linode instance in?

Are you still getting pingdom failures? It seems like it’s working now, possibly because your app got a new VM.

Assuming it is working, if happens again will you check flyctl status and see if your instance is showing restarts > 0?

My Linode instance is in Linode’s London data center.

Note that currently I’m just testing my authoritative DNS service, I can’t actually ship it until tcp/53 is open. :wink:

Unfortunately I’m still seeing failures even after restart.

Again, it looks like the DNS request is always being received by my server, it’s just the response that is frequently being being dropped.

FWIW I am seeing the same for YYZ running Pihole. I see my client request pop up in the Pihole dashboard but more often than not (subjectively) do not get a response back (I just ran a dig 10 times in a row and it worked half the time).

Just to make sure, I’ve tried different upstream DNS servers as well as running Unbound on it for recursion, but no luck. I’m happy to help debug during US Eastern time!

EDIT: I am accessing it via public IP right now if that helps, not via Wireguard yet.

Alrighty, we’ve been able to replicate the problem. UDP responses are, indeed, going missing sometimes. We’re working on it!

1 Like

I think we may have fixed it. Before I say more: did we fix it?

(I can bounce DNS requests off that wombat DNS thingy, from a host in London).

Confirmed! I now get a UDP response from my London DNS server, from my London Linode VM, every time.

Thanks for fixing this.

Just tested and I am still observing the issue in YYZ.

What’s the public IP of your app?

77.83.140.167

Poking at it. It’s not exactly the same case we saw with the LHR host, but it might be related.

(For the record: the case I’m talking about is super dumb; we just had a stray iptables rule that took forever for me to notice, because it only triggered from LHR to LHR.)

2 Likes

Just tested again and it seems to be resolved?

EDIT: Nope, but it was doing well for a while there!