Machines + builder + private docker registry?

I’d like to use fly machines, with remote builder and a private docker registry. Is this possible already? Can I use remote builder to push into a private registry and then define the pull secret when configuring a machine? Is there a way to use multiple remote builders? Do you see an issue when I “deploy” lots of machines in all regions? I guess machines are, just like apps, accessible from the internal network from within the org, right?

We don’t support private registries for machines yet. You’ll need to push to our registry to launch machines.

Are you trying to do customer builds with remote builders? You can’t use our remote builders to do that directly, but you can run machines that do builds yourself. Our builders are just this in machines: GitHub - superfly/rchab: Fly.io Remote Builder (Remote Controlled Hot Air Balloon)

Machines are accessible on their internal network. We are shipping network isolation features soon.

It sounds like you might be trying to isolate individual customer builds/registry access/networking, right?

We’re looking for a solution to offer a hosted version of WunderGraph.
The architecture is described here. What we’re looking at is running the “WunderGraph Server” as a multi-tenant application as one app on fly. Our users can customize their “Gateway behaviour” by writing TypeScript. This “user code”, we’d like to securely run in Machines, side-by-side with the Gateway in front. When a user makes a change to their codebase, we’d like to kick off a “build process” which checks out their repo, and builds the gateway config. From there on, we’d make an API call to our backend to say “here’s the new gateway config” as well as automatically deploy their TypeScript extensions into machines.

I guess it could be quite efficient to run one “builder machine” per user, as I want deployments to be blazing fast. If we allocate a dedicated CPU and a volume per customer, build should be cached and super fast I guess? This should still be cheap as we can immediately kill the builder machine once its done.

We’ll secure traffic at the HTTP layer, so it’s ok to expose these customer functions on the public internet via HTTPS. Ideally, we could disable private networking for customer functions, or at least disallow access to other resources within the org if possible.

If we push images to your registry with rchab, can we ensure that nobody except us can pull these images? I could probably create one org per customer, but then I’d need to be able to manage billing across multiple orgs automatically.

Stupid hack maybe, but could I create one org per customer, then run flyctl from within my own CI pipeline? That should automatically deploy a builder for each customer, right? Or maybe that’s not what you want?

So tldr, I’d like to deploy one multi tenant app in our org, and then per customer, multiple apps to run their “extensions” in a secure and cheap way using machines.

Please suggest the ideal architecture that makes sense on fly and doesn’t work “against your roadmap”.

I’ve added some thoughts above. Hoping to get some input.

@kurt I hope to get some answers so we can start a POC, thanks.

@jens yes this will work great! We built Fly Machines for this exact problem.

Our builder / registry setup aren’t quite right for what you’re doing, though. Our registry isn’t designed so an untrusted user environment can auth an push to it. We expect the person doing the pushing to have permissions for the app they’re pushing to.

What I would do is create a small Docker registry proxy that knows about your users’ auth, and pushes builds to registry.fly.io with your Fly.io credentials. We might have a better option for this in the future, “storing users’ builds” is something we haven’t fully solved yet.

Once you have Docker images ready, running the whole setup is simple. Create a gateway app, and then create a fly app per user. This will give you the most flexibility.

There are two undocumented features you’ll need though. :

  • Your gateway app can respond with fly-replay: app=<customer-app-name> after it accepts a request. This will reroute the request to your user’s nearest machine, start it up if needed, and handle the response.
  • Create each app with a unique network name. Pass network: "<customer-app-name>" through the POST /v1/apps call and their machines will run on isolated private networks.

Does that help?

1 Like

We’d always run the CI on our end, so we can push the images ourselves that’ll work fine.

Following up, let’s imagine we provide this faas service in all 21 locations. This will mean that on each customer deployment, we’ll build and push an image, and then make 21 API calls to create/update 21 machines, one per region, for every customer app. Can there be limitations we’ll run into?

Additionally, I’ve observed that the REST API doesn’t really reflect the latest state if the VM. If I kill the machine with the REST API, it takes a few seconds until the REST API confirms this state. I’ve had similar observations, when re-starting a machine by making an http call, it takes a few seconds until the REST API tells that the service is running.

If I want to build a state machine around this API, it’ll help a lot if it wasn’t eventually consistent.

Btw. we’ve only tested with the http handler, but does waking a machine also work with the TCP handler?

That said, with the info we have, we’ll start our POC, thanks for your help.

If you want customer machines in every region, 21 API calls is the way to go. That’s probably not what you want though! What I would do is track metrics from your gateway app and create machines where you need them.

But updating 21 different machines in sequence is totally fine. There are no known limitations you’ll run into. If you do hit issues, let us know. This is actually how fly deploy will work soon, flyctl will make a bunch of requests to the API to do a rolling deploy.

The REST API is strongly consistent for an individual machine status like GET /v1/apps/<app>/machines/<id>. It’s eventually consistent for an index, though. We have some improvements to make the index refresh faster, but you should use the individual machine endpoint for anything that needs strong consistency (the listing can’t be strongly consistent, really).

Waking a machine might work with TCP, but it’s not supported yet. This is very much designed for HTTP. fly-replay makes this simple to build for HTTP apps.

1 Like

Well, Thomas wrote here that waking a machine up should work even with UDP!

Also, we were using Machines with TCP and it worked just fine for 15 hours or so… until all VMs went unresponsive and never recovered. Not sure if the issue was with TCP+Machines or generally Machines.