Dynamic machine metadata

root@91857779a76648:/# curl --unix-socket /.fly/api "http://localhost/v1/apps/dov-testing-stuff/machines/91857779a76648/metadata"
root@91857779a76648:/# curl --unix-socket /.fly/api "http://localhost/v1/apps/dov-testing-stuff/machines/91857779a76648/metadata/random_key" -d '{"value":"random"}'
root@91857779a76648:/# curl --unix-socket /.fly/api "http://localhost/v1/apps/dov-testing-stuff/machines/91857779a76648/metadata"

There are a bunch of new things here! Let’s break it down.

  1. New api to update machine metadata at runtime, without updating or restarting it.
GET /v1/apps/:app-name/machines/:id/metadata
POST /v1/apps/:app-name/machines/:id/metadata/:key {"value":"the value"}
DELETE /v1/apps/:app-name/machines/:id/metadata/:key
  1. Authenticated proxy to access parts of the machines api inside the VM

There is now a unix socket in machines (old machines will need to be restarted to get access) at /.fly/api. For now it only grants access to the metadata parts of the api, and only for machines in the given app. The hostname part of the url doesn’t matter as it will be replaced. That means

root@91857779a76648:/# curl --unix-socket /.fly/api "http://localhost/v1/apps/dov-testing-stuff/machines/91857779a76648/metadata"
root@91857779a76648:/# # is functionally the same as
root@91857779a76648:/# curl --unix-socket /.fly/api "http://verynotrealdomain.faketld/v1/apps/dov-testing-stuff/machines/91857779a76648/metadata"

An important note is that this means anything thats root in a machine can read and write metadata for any machine in its app. If you are using machines to run user workloads, you should be aware of this. You should probably be using separate apps per user anyway but this is something to keep in mind.

There is a ton of “magic” that happens behind the scenes to make this feature possible and we go into that a bit more in the other post for those of you interested in the nitty gritty details.

Why we need this

This came from the open source database team at Fly.io. Our remit is to improve the platform to make it better for running open source databases.

When we first built our Postgres launcher, we wanted to hook apps up with zero config. Apps generally expect to connect to a writable instance. Sadly, Postgres drivers don’t reliably solve that problem. So we crammed an HAProxy process into each VM to route connections to the primary node.

This worked great for initial setup. HAProxy is a slick piece of kit. Over time, we noticed some edge case issues. First, using DNS to route to individual Postgres VMs caused problems when individual Postgres VMs shit themselves. We run a pretty wicked global proxy already, so we figured: why not route all Postgres connections through a thing that knows if a VM is in a bad state?

We now have two proxies in front of Postgres. Debugging network timeouts is a tremendous pain. It gets worse when you layer proxies. If you search Discourse, you’ll find lots of people surprised that their DB connection timed out. Removing HAProxy is a step towards making this problem go away.

Which gets us back to Machine metadata. If we’re going to make our own proxy route to a writable Postgres Machine, it needs to know which Machine qualifies. And, more generally, it’s useful to teach our proxy to route to Machines in different states.

So step one, let Machines tell us about their state. We have a notion of Machine metadata already. Giving Machines a mechanism to update their own metadata is like a nice, general purpose platform function we can use to solve Postgres routing.

So, tldr: we needed our proxy to route to writable Postgres. We didn’t want it to just route to Postgres, so we scope creeped this problem until we found a path to letting y’all use the plumbing for things we haven’t yet thought of. And now you get a feature along the way.

Use cases

With that in mind, here is a sneak peak of things to come using these new features. We’d love to hear what you’ll build as well.

  1. Replacing haproxy in our postgres images by using dynamic metadata to force routing a service to a specific Machine. This is valuable for DB reliability, because it strips out an whole intermediary layer between your database client and your database. Currently, with a modern fly PG setup your connection goes through two proxies: client → fly proxy → haproxy → pg node. This causes problems like hard to debug tcp timeouts (proxies over proxies are hard). With dynamic metadata+metadata aware fly-proxy, we can teach fly-proxy to route directly to the primary without having to go through haproxy.
  2. Querying machines in DNS via metadata: key.value.metadata.appname.internal

If you’re interested, we wrote a follow-up post on the technical details.