Machine questions

Hey, we’re looking to build a FaaS type service on-top of Fly, and ‘scale-to-zero’ is a requirement for us therefore Machines seem to be the best/only options. We’ve been doing a bunch of research on these forums & the docs, however there’s a few unknowns which we can’t seem to find a solid answer to.

1. Fly Apps v2 + Machines

With Fly Apps we get features sure as automatic scaling, easily adding regions etc. With Fly Apps v2 running on machines, will we eventually get these features but with the “scale to zero” ability? Currently it seems there’s no way to easily scale machines based on demand - you need to ensure there is enough machine instances ‘available’ to handle expected traffic. I’m not sure how this model works with traffic spikes?

2. Cloning machines

You can clone a machine to different regions, but can you clone a machine multiple times in one region (e.g. London). If so, will traffic be load balanced between those?

3. Scale plan + machine limits

Are there any limits (if we’re on the Scale plan) to how many machines (and I guess apps) you can create on a single org?

We’re thinking about a system to handle preview builds, where we create a machine (and app) per deployment, and delete them after say 14 days. These machines would scale to zero, but it could potentially mean creating a lot of them - are there any restrictions per org to be aware of? I can’t see any in the docs.

4. Updating machines

What’s the best practice to updating the image on machine(s)? Fly Apps seems to create new instances alongside the old ones, wait for the health check to be successful and eventually remove the old ones.

Do we need to manage this ourselves with Machines to prevent downtime? E.g. for each machine, create a new one with the new image, wait for it to come online and delete the old one. The docs mention you can update a machine, but I assume this will have downtime when it restarts?

5. “Indisposed hardware”

The blog post mentions you should prepare for hardware being indisposed. We’re not planning to run 64GB RAM machines, but if we successfully create a machine and leave it inactive (waiting for a request), does this also mean there is a chance that it won’t be able to start (and thus requests hang).

If so, how would we go about handling this? I assume the machine will error in some form, but it’s unclear on how you handle that situation automatically (e.g. attempt to create another machine in a different region). How do we know it failed to start?


Thanks - how these questions make sense. It’s really cool to see the API has support for pretty much all Fly operations to make this possible, these questions are more around how to tie all of that together.

Lots to cover here!

Keep in mind that I am not the final say on product direction, I’m along for the ride as well, so some answers related to features are really “This is what I think will happen”.

I also don’t dev on machines, but I’m here before others this morning, so let’s start with having me let you know as much as I know :smiley:

1. Fly apps v2 + machines

Apps v2 (running on machines) will eventually be on par with Apps v1. Currently you need to scale manually yourself. I’ve seen discussions internally about figuring out how auto-scaling will work - it’s being thought on.

2. Cloning

The Fly Proxy will load balance traffic across all Machine instances of an application within a region.

Let us know if you have more specific questions here, there are details to this but they aren’t worth going into yet.

3. Scale plan + machine limits

There are machine limits. I’m being intentionally vague about what they are as they are related to both limiting abuse and ensuring no one person can eat up entire hosts by generating too many machines.

We can lift these on request, but you need to talk to support.

You’ll start to see 422 HTTP responses when creating a machine if you hit limits. The JSON body response will let you know you’ve hit a limit.

4. Updating Machines

Since machines are a lower-level “primitive”, updating a machine is not directly analogous to deploying a new version of an v1 app.

Updating a machine (by changing some parameter such as the image its based on, or scaling instance size) creates a new machine (replacing the old machine) and therefore can cause downtime during that period.

I believe apps v2 will handle that so fly deploy on V2 will do rolling updates, etc.

5. “Indisposed hardware”

There is a chance a machine won’t be able to start. I think (but am not 100% sure) that you’ll get an HTTP error response if there is no capacity available when starting an existing machine (immediate feedback, vs something async).

Handling that scenario is currently a manual exercise. For example: Create a new machine instead of start an existing one. Perhaps create a new machine and delete the old one that failed, this keeping a “stable” number of total machines in existence (something like that!)

3 Likes

Is this for certain, now? See: Rolling my own autoscaling for Fly Machines - #4 by kurt

1 Like

I don’t think I see a contradiction there - the fly proxy load balances between machines in an app, but doesn’t autoscale (which seems to be what the linked topic is more about).

It will, however, start a stopped machine. It just doesn’t change the number of machines available to load balance between.

1 Like

Thanks for the repies!

Apps v2 (running on machines) will eventually be on par with Apps v1. Currently you need to scale manually yourself. I’ve seen discussions internally about figuring out how auto-scaling will work - it’s being thought on.

That’s good to know - I assume that’ll be a while before we get to that position though?

The Fly Proxy will load balance traffic across all Machine instances of an application within a region.

Is this for certain, now? See: Rolling my own autoscaling for Fly Machines - #4 by kurt

Yes in-fact this was one of the questions which triggered mine. So if 3 are within a region, it will load balance between them (assuming as distributed as it can be although I don’t have too much knowledge here), just not create new ones.

Updating a machine (by changing some parameter such as the image its based on, or scaling instance size) creates a new machine (replacing the old machine) and therefore can cause downtime during that period.

Thanks, thought as much. So I guess it is possible if you bring up a new machine, wait for to become live and bring down the old ones. This obviously adds a bunch of work as it needs to handle failures, tracking the state of machines and writing rolling update logic.

There is a chance a machine won’t be able to start. I think (but am not 100% sure) that you’ll get an HTTP error response if there is no capacity available when starting an existing machine (immediate feedback, vs something async).

Overall I think this is my biggest concern. Without Fly managing autoscaling (by resource), Machines don’t seem to be useable running unknown user code. If it were my own project with an insight into traffic & general resource requirements, I could provision as many machines as I think I’ll need. However, a FaaS model involves running users application code - their project could demand a lot of resource. The only ‘solution’ to this, is to create lots of machines which can scale to zero. However even then, it still might not be enough (or overkill for small projects).

Create a new machine instead of start an existing one

Doing this automatically doesn’t seem possible - I’d need to keep track of resources / machine states etc (at this point you’re starting to build your own orchestration service around Fly).


It seems like this project is only going to be viable on a “Fly Apps v2 with scale-to-zero and autoscaling”.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.