Process group-aware internal DNS: route between processes with ease!

TL;DR

You can now use <groupname>.process.<appname>.internal as a hostname, and our internal DNS will resolve it to specific process groups of an app.

What’s this for?

Sometimes, you need to run multiple processes. (for example, Sidekiq for Rails) Process groups let you do this, and they work pretty well.

One thing they haven’t been particularly great at, though, is communication. They can talk to the app as a whole, but there was no way to filter out specific process groups to talk to. In particular, we’ve seen many cases recently where people have apps with a frontend component, a backend component, and a requirement to update them together - usually we’d suggest keeping these as separate apps, but if they have to update together, that’s not a good fit.

Now, you can deploy an app with a frontend and backend process group, and the frontend process can communicate with the backend process via backend.process.<appname>.internal

Demonstration


Let’s take a look at this in action.

I have a simple fly.toml

app = "ali-process-dns-example"
primary_region = "atl"

[processes]
  a = "sleep inf"
  b = "sleep inf"
[build]
  image = "ubuntu"

After running fly launch --now --ha=false, I should have one “a” and one “b” machine.

❯ fly m list
2 machines have been retrieved from app ali-process-dns-example.
View them in the UI here

ali-process-dns-example
ID            	NAME              	STATE  	REGION	IMAGE                	IP ADDRESS                    	VOLUME	CREATED             	LAST UPDATED        	APP PLATFORM	PROCESS GROUP	SIZE
148edd40fed989	delicate-dawn-5150	started	atl   	library/ubuntu:latest	fdaa:1:a82a:a7b:e6:3aa1:9d6b:2	      	2023-05-23T19:52:13Z	2023-05-23T19:52:14Z	v2          	a            	shared-cpu-1x:256MB
e2865502c4d286	withered-pond-973 	started	atl   	library/ubuntu:latest	fdaa:1:a82a:a7b:e5:ff66:ea2:2 	      	2023-05-23T19:52:26Z	2023-05-23T19:52:26Z	v2          	b            	shared-cpu-1x:256MB

I’ll hop into one of them, let’s go with the “a” machine, and install dig, a tool for performing manual DNS lookups.

❯ fly console --machine 148edd40fed989
Connecting to fdaa:1:a82a:a7b:e6:3aa1:9d6b:2... complete
root@148edd40fed989:/# apt update && apt install -y dnsutils

Now, I should be able to look up “b”'s IP.

root@148edd40fed989:/# dig +short AAAA b.process.ali-process-dns-example.internal
fdaa:1:a82a:a7b:e5:ff66:ea2:2

Lo and behold, it’s the same IP that fly m list gave us!

Of course, you don’t have to use dig to do this. It’s just a hostname, and our DNS is the part doing the magic :slight_smile:

In the real world, you probably won’t be spinning up machines to run dig (not judging if that’s your hobby), but this is a nice building block for connecting pieces of a larger application together!

10 Likes

Sorry somewhat related but not really in topic. Feel free to spin up into a different topic.
I’ve been trying to get an answer from email support with no luck.

I’ve been seeing issues related to the internal DNS resolver when used together with internal apps having standby machines. The internal DNS <appname>.internal seems to return both the running machine and the stopped (standby) machine, could this be a regression related to new features added to the resolver ?

I can safely say they’re unrelated, this was deployed only like an hour before my post went live.

I’m sorry that you haven’t had a great experience with support - what I can say is that they’ve raised the issue internally, and it’s been identified as a bug. I don’t have any more information yet, but support should keep you posted as we try to work out a fix.

Thanks @allison . Glad it’s been identified as bug already!

I’m having some issues with this feature :thinking:

It looks like only one machine is returned:

dig +short AAAA <app>.internal
[ip1]
[ip2]

dig +short AAAA web.process.<app>.internal
[ip2]

When <region> is present in the DNS query, the result includes all machines in the region :ok_hand:<group-name>.process.<region>.<app-name>.internal.

Unfortunately, when using the canary release strategy, the canary machines show up in the query results for a few seconds.

Unfortunately, when using the canary release strategy, the canary machines show up in the query results for a few seconds.

@containerops Can you explain a bit more? What are you seeing? What do you expect to see?

1 Like

Great feature, by the way :purple_heart:


So it seems that when using the canary release strategy, the canary-VM (a short-lived machine that will stop as soon as the health checks pass) will also be discovered and included in the DNS response for the web process:

dig +short AAAA web.process.<region>.<app-name>.internal
[machine-xxxxx-ipv6]
[machine-xxxxx-ipv6]
[canary-vm-ipv6]

Considering there is a small delay in propagation (which is okay, not the issue here), the listing of the canary-VM causes some issues because when another app tries to use it, it is guaranteed to not exist anymore, as these machines are short-lived by definition.

I was not expecting to see the canary-VM being listed in the web process DNS response, as these machines are temporary and part of the Fly.io release process, rather than being actual instances of the app. These machines would already be in the process of stopping by the time they are shown in the DNS query, so it’s essentially an invalid address.

Of course, the current behaviour would make sense if, instead of stopping the newly created canary-VM, the old machines running the previous version were stopped/swapped. However, this is not the case - the canary-VM is only used to ensure that the health checks pass before proceeding with the rolling release, and then stopped.

So maybe they shouldn’t be included in the group-aware internal DNS responses.


Not related, but there is a bug report here:

  1. Process group-aware internal DNS: route between processes with ease! - #5 by containerops
  2. Querying Instance(s) from Specific Process Group in fly.toml and .internal DNS - #2 by containerops

web.process.<app>.internal only returns one machine:

dig +short AAAA <app>.internal
[ip1]
[ip2]

dig +short AAAA <group-name>.process.<app>.internal
[ip2]

<group-name>.process.<region>.<app-name>.internal including the region is a workaround:

dig +short AAAA <app>.internal
[ip1]
[ip2]

dig +short AAAA <group-name>.process.<region>.<app-name>.internal
[ip1]
[ip2]

So it seems that when using the [canary release strategy], the canary-VM (a short-lived machine that will stop as soon as the health checks pass) will also be discovered and included in the DNS response for the web process:

Thx for the thorough explanation. We’ll take a look at that behaviour.

Not related, but there is a bug report here:

Thx for flagging! Will find out if it’s intended behaviour and release a fix if not.

Your bug is actually three bugs standing on top of each other in a trench coat. Here’s a status update:

  1. Our DNS resolver doesn’t support multiple chained queries (e.g. condition-1.condition-2.<app>.internal). It should probably return an error for this query.

  2. Process group filtering should return multiple entries. For now, you can use <group>.fly_process_group.kv._metadata.<appname>.internal. It’s a bit verbose and theoretically a tiny bit slower, but it does the same thing and this query does return multiple entries.

  3. Canary machines shouldn’t register DNS entries. This is a flyctl bug, with a fix already implemented and shipping in the next version.

Thanks for the detailed bug report information!

1 Like