Great feature, by the way
So it seems that when using the canary release strategy, the canary-VM (a short-lived machine that will stop as soon as the health checks pass) will also be discovered and included in the DNS response for the web process:
dig +short AAAA web.process.<region>.<app-name>.internal
[machine-xxxxx-ipv6]
[machine-xxxxx-ipv6]
[canary-vm-ipv6]
Considering there is a small delay in propagation (which is okay, not the issue here), the listing of the canary-VM causes some issues because when another app tries to use it, it is guaranteed to not exist anymore, as these machines are short-lived by definition.
I was not expecting to see the canary-VM being listed in the web process DNS response, as these machines are temporary and part of the Fly.io release process, rather than being actual instances of the app. These machines would already be in the process of stopping by the time they are shown in the DNS query, so it’s essentially an invalid address.
Of course, the current behaviour would make sense if, instead of stopping the newly created canary-VM, the old machines running the previous version were stopped/swapped. However, this is not the case - the canary-VM is only used to ensure that the health checks pass before proceeding with the rolling release, and then stopped.
So maybe they shouldn’t be included in the group-aware internal DNS responses.
Not related, but there is a bug report here:
- Process group-aware internal DNS: route between processes with ease! - #5 by containerops
- Querying Instance(s) from Specific Process Group in fly.toml and .internal DNS - #2 by containerops
web.process.<app>.internal
only returns one machine:
dig +short AAAA <app>.internal
[ip1]
[ip2]
dig +short AAAA <group-name>.process.<app>.internal
[ip2]
<group-name>.process.<region>.<app-name>.internal
including the region is a workaround:
dig +short AAAA <app>.internal
[ip1]
[ip2]
dig +short AAAA <group-name>.process.<region>.<app-name>.internal
[ip1]
[ip2]