Hey, we’re looking to build a FaaS type service on-top of Fly, and ‘scale-to-zero’ is a requirement for us therefore Machines seem to be the best/only options. We’ve been doing a bunch of research on these forums & the docs, however there’s a few unknowns which we can’t seem to find a solid answer to.
1. Fly Apps v2 + Machines
With Fly Apps we get features sure as automatic scaling, easily adding regions etc. With Fly Apps v2 running on machines, will we eventually get these features but with the “scale to zero” ability? Currently it seems there’s no way to easily scale machines based on demand - you need to ensure there is enough machine instances ‘available’ to handle expected traffic. I’m not sure how this model works with traffic spikes?
2. Cloning machines
You can clone a machine to different regions, but can you clone a machine multiple times in one region (e.g. London). If so, will traffic be load balanced between those?
3. Scale plan + machine limits
Are there any limits (if we’re on the Scale plan) to how many machines (and I guess apps) you can create on a single org?
We’re thinking about a system to handle preview builds, where we create a machine (and app) per deployment, and delete them after say 14 days. These machines would scale to zero, but it could potentially mean creating a lot of them - are there any restrictions per org to be aware of? I can’t see any in the docs.
4. Updating machines
What’s the best practice to updating the image on machine(s)? Fly Apps seems to create new instances alongside the old ones, wait for the health check to be successful and eventually remove the old ones.
Do we need to manage this ourselves with Machines to prevent downtime? E.g. for each machine, create a new one with the new image, wait for it to come online and delete the old one. The docs mention you can update a machine, but I assume this will have downtime when it restarts?
5. “Indisposed hardware”
The blog post mentions you should prepare for hardware being indisposed. We’re not planning to run 64GB RAM machines, but if we successfully create a machine and leave it inactive (waiting for a request), does this also mean there is a chance that it won’t be able to start (and thus requests hang).
If so, how would we go about handling this? I assume the machine will error in some form, but it’s unclear on how you handle that situation automatically (e.g. attempt to create another machine in a different region). How do we know it failed to start?
Thanks - how these questions make sense. It’s really cool to see the API has support for pretty much all Fly operations to make this possible, these questions are more around how to tie all of that together.