DNS query to regions not always valid

jamesbirtles · July 15, 2022, 10:07am

I seem to just be having DNS related issues lately. Alongside .internal DNS occasionally stops working for some apps? I am now having issues with the TXT record for regions.<app-name>.internal.

So I have a nats cluster set up using the nats guide: Global NATS Cluster · Fly Docs
Which was working fine for a while. I have this set up only in dev at the moment so I’m not sure when it broke, but coming back to it today, it was not working.

The logs are spamming these lines constantly:

2022-07-15T09:50:49.785 app[ff913d1e] mia [info] nats-server | [528] [ERR] [fdaa:0:6f00:a7b:2c00:ff91:3d1e:2]:7222 - gid:23267 - Failing connection to gateway "lhr", remote gateway name is "mia"
2022-07-15T09:50:49.785 app[ff913d1e] mia [info] nats-server | [528] [INF] [fdaa:0:6f00:a7b:2c00:ff91:3d1e:2]:7222 - gid:23267 - Gateway connection closed: Wrong Gateway
2022-07-15T09:50:49.785 app[ff913d1e] mia [info] nats-server | [528] [ERR] [fdaa:0:6f00:a7b:2c00:ff91:3d1e:2]:37170 - gid:23268 - authentication error
2022-07-15T09:50:49.786 app[ff913d1e] mia [info] nats-server | [528] [INF] [fdaa:0:6f00:a7b:2c00:ff91:3d1e:2]:37170 - gid:23268 - Gateway connection closed: Authentication Failure
2022-07-15T09:50:49.787 app[ff913d1e] mia [info] nats-server | [528] [INF] Connecting to implicit gateway "lhr" (fly-local-6pn:7222) at [fdaa:0:6f00:a7b:2c00:ff91:3d1e:2]:7222 (attempt 1)
2022-07-15T09:50:49.787 app[ff913d1e] mia [info] nats-server | [528] [INF] [fdaa:0:6f00:a7b:2c00:ff91:3d1e:2]:7222 - gid:23269 - Creating outbound gateway connection to "lhr"
2022-07-15T09:50:49.788 app[ff913d1e] mia [info] nats-server | [528] [INF] [fdaa:0:6f00:a7b:2c00:ff91:3d1e:2]:37172 - gid:23270 - Processing inbound gateway connection

I have two regions currently set up (lhr and mia). Restarting instances appears to result in one of two different errors popping up. Either the one listed above, or an error from nats about an empty gateway name.

I have dug through the go code in the nats example to find that it looks up the TXT record of regions.<app-name>.internal. Sometimes this returns no results, causing the go code to get an empty string and create a malformed nats config with a blank gateway name, explaining that error from nats. And other times this returns some subset of the actual regions.

Example of the (seemingly malformed?) response to a TXT record query when nats is reporting the gateway name error:

$ dig txt regions.aircast-nats-dev.internal       
;; Warning: Message parser reports malformed message packet.

; <<>> DiG 9.16.27-Debian <<>> txt regions.aircast-nats-dev.internal
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 6403
;; flags: qr rd ad; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available
;; WARNING: Message has 23 extra bytes at end

;; QUESTION SECTION:
;regions.aircast-nats-dev.internal. IN  TXT

;; Query time: 4 msec
;; SERVER: fdaa::3#53(fdaa::3)
;; WHEN: Fri Jul 15 09:35:21 UTC 2022
;; MSG SIZE  rcvd: 86

Example of whats returned (sshed into the mia instance) when the mia instance is struggling to connect to lhr.

$ dig txt regions.aircast-nats-dev.internal

; <<>> DiG 9.16.27-Debian <<>> txt regions.aircast-nats-dev.internal
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 41926
;; flags: qr rd ad; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 9253152c41b5ee43 (echoed)
;; QUESTION SECTION:
;regions.aircast-nats-dev.internal. IN  TXT

;; ANSWER SECTION:
regions.aircast-nats-dev.internal. 5 IN TXT     "mia"

;; Query time: 4 msec
;; SERVER: fdaa::3#53(fdaa::3)
;; WHEN: Fri Jul 15 09:46:10 UTC 2022
;; MSG SIZE  rcvd: 90

For whatever reason the regions.<app-name>.internal record is sometimes not reporting all regions and sometimes not even reporting itself. And the nats cluster example really doesn’t seem to cope when this is messed up.

I should note that queries of regions.<app-name>.internal performed from my machine connected via wireguard show the all the regions correctly.

jamesbirtles · July 15, 2022, 10:53am

I’m starting to think lhr is just cursed

kurt · July 15, 2022, 1:49pm

This is super helpful information, thank you.

This is basically the #1 thing we’re trying to suss out right now.

Do you happen to know which VM IDs you were connected to when you ran your dig commands?

jamesbirtles · July 15, 2022, 2:10pm

I’ve been messing around with the servers trying to get something to work for a while so I don’t know what the IDs were for those commands. However I’m still seeing similar behaviour in the latest instances of these vms so here is the current state:

I currently have three regions. SYD (dc3f2f51), DFW (4ec30252), and LHR (454c8bc0).
The seem to be in a state where they can all see SYD, and DFW, but none of them can see LHR (LHR can’t even see itself). When I say see I mean that it doesn’t appear in the regions list and is not accessible via any of the internal DNS addresses, it is however accessible using the IPv6 address directly

kurt · July 15, 2022, 2:13pm

Ok great, this is enough for at least 2 hours of debugging. Which I will go do now.

kurt · July 15, 2022, 2:16pm

Are you ok leaving those running for a while?

jamesbirtles · July 15, 2022, 2:18pm

Yep thats okay

kurt · July 15, 2022, 8:48pm

Ok I think we found the issue. We had two london hosts with conflicting identifiers. This was causing other hosts to overwrite DNS data for apps, or miss it entirely.

This definitely means london was cursed. It should be working smoothly now.

I really appreciate you debugging this with us (and leaving the weird stuff in place so we could look at it). Definitely post if things haven’t improved, or if you see any other weirdness.

jamesbirtles · July 16, 2022, 7:11am

Brilliant, looks like each node can see eachother now! Thanks a lot for fixing this