I seem to just be having DNS related issues lately. Alongside .internal DNS occasionally stops working for some apps? I am now having issues with the TXT record for regions.<app-name>.internal
.
So I have a nats cluster set up using the nats guide: Global NATS Cluster · Fly Docs
Which was working fine for a while. I have this set up only in dev at the moment so I’m not sure when it broke, but coming back to it today, it was not working.
The logs are spamming these lines constantly:
2022-07-15T09:50:49.785 app[ff913d1e] mia [info] nats-server | [528] [ERR] [fdaa:0:6f00:a7b:2c00:ff91:3d1e:2]:7222 - gid:23267 - Failing connection to gateway "lhr", remote gateway name is "mia"
2022-07-15T09:50:49.785 app[ff913d1e] mia [info] nats-server | [528] [INF] [fdaa:0:6f00:a7b:2c00:ff91:3d1e:2]:7222 - gid:23267 - Gateway connection closed: Wrong Gateway
2022-07-15T09:50:49.785 app[ff913d1e] mia [info] nats-server | [528] [ERR] [fdaa:0:6f00:a7b:2c00:ff91:3d1e:2]:37170 - gid:23268 - authentication error
2022-07-15T09:50:49.786 app[ff913d1e] mia [info] nats-server | [528] [INF] [fdaa:0:6f00:a7b:2c00:ff91:3d1e:2]:37170 - gid:23268 - Gateway connection closed: Authentication Failure
2022-07-15T09:50:49.787 app[ff913d1e] mia [info] nats-server | [528] [INF] Connecting to implicit gateway "lhr" (fly-local-6pn:7222) at [fdaa:0:6f00:a7b:2c00:ff91:3d1e:2]:7222 (attempt 1)
2022-07-15T09:50:49.787 app[ff913d1e] mia [info] nats-server | [528] [INF] [fdaa:0:6f00:a7b:2c00:ff91:3d1e:2]:7222 - gid:23269 - Creating outbound gateway connection to "lhr"
2022-07-15T09:50:49.788 app[ff913d1e] mia [info] nats-server | [528] [INF] [fdaa:0:6f00:a7b:2c00:ff91:3d1e:2]:37172 - gid:23270 - Processing inbound gateway connection
I have two regions currently set up (lhr and mia). Restarting instances appears to result in one of two different errors popping up. Either the one listed above, or an error from nats about an empty gateway name.
I have dug through the go code in the nats example to find that it looks up the TXT
record of regions.<app-name>.internal
. Sometimes this returns no results, causing the go code to get an empty string and create a malformed nats config with a blank gateway name, explaining that error from nats. And other times this returns some subset of the actual regions.
Example of the (seemingly malformed?) response to a TXT record query when nats is reporting the gateway name error:
$ dig txt regions.aircast-nats-dev.internal
;; Warning: Message parser reports malformed message packet.
; <<>> DiG 9.16.27-Debian <<>> txt regions.aircast-nats-dev.internal
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 6403
;; flags: qr rd ad; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available
;; WARNING: Message has 23 extra bytes at end
;; QUESTION SECTION:
;regions.aircast-nats-dev.internal. IN TXT
;; Query time: 4 msec
;; SERVER: fdaa::3#53(fdaa::3)
;; WHEN: Fri Jul 15 09:35:21 UTC 2022
;; MSG SIZE rcvd: 86
Example of whats returned (sshed into the mia instance) when the mia instance is struggling to connect to lhr.
$ dig txt regions.aircast-nats-dev.internal
; <<>> DiG 9.16.27-Debian <<>> txt regions.aircast-nats-dev.internal
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 41926
;; flags: qr rd ad; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 9253152c41b5ee43 (echoed)
;; QUESTION SECTION:
;regions.aircast-nats-dev.internal. IN TXT
;; ANSWER SECTION:
regions.aircast-nats-dev.internal. 5 IN TXT "mia"
;; Query time: 4 msec
;; SERVER: fdaa::3#53(fdaa::3)
;; WHEN: Fri Jul 15 09:46:10 UTC 2022
;; MSG SIZE rcvd: 90
For whatever reason the regions.<app-name>.internal
record is sometimes not reporting all regions and sometimes not even reporting itself. And the nats cluster example really doesn’t seem to cope when this is messed up.
I should note that queries of regions.<app-name>.internal
performed from my machine connected via wireguard show the all the regions correctly.