Hi, I’m new to fly.io. Is there any way to ensure that there is never more than one machine? We have a SaaS that needs to perform exactly-once operations. Therefore, we need to ensure that only one server is active.
Thanks, Max
Hi, I’m new to fly.io. Is there any way to ensure that there is never more than one machine? We have a SaaS that needs to perform exactly-once operations. Therefore, we need to ensure that only one server is active.
Thanks, Max
Hi,
As long as you don’t enable auto-scaling (I think that’s off by default) then you decide how many machines to create.
When you first create an app it may default to creating two machines (since generally most people would want that for redundancy). If so, simply scale down to 1.
Check out this part of the docs about minimums and maximums:
When you fly launch
or fly deploy
a Fly app, you can specify --ha=false
, which will prevent it from creating more than one machine. If you don’t do anything to set up scaling beyond this, it should never have more than one machine.
Just in case, we provision 2 machines by default for availability, rather than scalability. Even if you app doesn’t serve a lot of requests, if you don’t want to make your website down, it is better to have more than one machines.
Thanks a lot at all for the answers.
We have a background job that needs the exactly-once semantic. Distributed locks are really challenging (at least for me). An advisory lock on Postgres could be used. But the most robust way would be if the platform could guarantee that in any situation there is at most one server running. However, for most other scenarios I think choosing availability over consistency (CAP) makes more sense. What does Fly pick?
Don’t reinvent the wheel, use db to handle the distributed locks. And don’t write your own job orchestration… use something like temporal, which has dedupe ids, which uses its underlying db for its lock.
I just want to avoid distributed systems completely. A good old (real) server with a monolith would fulfill my requirement with ease.
However, for our SaaS I would like to have one server / monolith per customer (that scales to zero if it’s not used). The further development of our current multi-tenant (distributed) system becomes ever more complex and risky.
Multi-tenancy is a giant source of complexity (Multi-tenancy is what’s hard about scaling web services).
It’s a good idea to keep things simple. Are your customers okay when their app goes down due to regional outages? You shouldn’t be afraid of using distributed system, just don’t try to create your own.
We running our system in a single zone on Google Cloud since 8 years and Google never had a single outage in all the time. Therefore I rather save the money and complexity for deploying everything in multiple zones or even regions. Our customers would be okay with short outages, it’s a social media tool, so nothing vital for life.
Wouldn’t your systems be so much simpler if you could run everything on a single machine? Distributed systems aka network calls introduces so many new error types to a system.
Just in case, Google Cloud does live migration of VMs, whereas Fly doesn’t. It is not apples-to-apples.
Regarding distributed systems in general, Marc Brooker has written Not Just Scale recently. This could be an interesting read. Distributed systems are complex, but there are reasons people (including ourselves) make distributed systems
Yes, Google Cloud’s VM live migration mitigates many failures. I once observed an uptime of 624 days on one of our VMs. However, the live migration is an excellent example of how to protect the developer from the underlying unreliable hardware. I’m also genuinely interested in studying and building distributed systems and I also agree to all the points in the article you shared But I also try to follow this:
The first rule of distributed systems is don’t distribute your system until you have an observable reason to.
No question, Fly needs to be a distributed system. Even our small SaaS profits from some parts being distributed (Google Cloud Storage, database). But maybe there are ways to push the distributed parts to the edges of the system?
The dhh article and the strikingly simple deployment model of Once.com got me thinking if I can reduce the complexity of our SaaS by having one “server” (container) per customer. The option for a customer server to be offline would even help during database migrations.
Fly would be a perfect platform for a Once.com-like deployment model. You could even offer customers to place their container in a near by Fly region. But for us as Saas with 3130 paying customers it would makes sense to have more control about how many containers are packed on a Fly hardware server to make better trade-offs regarding costs and cold-start time for our use case.
I made the experience that we could massively reduce the complexity of our web app by getting rid of its multi-tenancy features and letting it only serve one customer (team). In this new system, we even have one logical database per customer. We are only three developers and we can definitely not afford to spend months or even years on database optimizations. This example of a 20+ team spending more than two years at Figma to restructure their Postgres got me thinking. Their sharding schema basically end up being one shard per customer (team).
FWIW, I came to a similar conclusion: blue print, slides. That does mean that you are responsible for your own backups and disaster recovery, but like you I found that to be the right trade-off. I figure that a machine becoming inaccessible for any reason is a rare event, and I can redeploy a customer to a new machine with their data in a matter of minutes. If that were to become a problem, I could reduce that to seconds, but it hasn’t been a problem.
I gather that you are considering one app per customer, each with one machine. I went with one app, with one machine per group of events, where a customer may have multiple events. When a request comes in to a wrong machine, I either replay or reverse proxy it to the right machine.
I limit myself to one machine per region with the following simple but effective code: showcase/bin/deploy at 1f9eeb3a857e5797fab3b7164307e7c514e89e24 · rubys/showcase · GitHub
In my case, it is one machine per region. That’s because I’m an order of magnitude smaller than you, and do host a small number of geographically close customers (each with potentially multiple events) together on one machine, with one running instance of my app per event.
But you could do the same thing with Dynamic Machine Metadata. Or, if you go with one app per customer, you won’t have to worry about metadata or even regions.
Thanks a lot for sharing. I’m glad to hear that I’m not the only one who came to these conclusions. I was already afraid that something is fundamentally wrong with the approach
It is great that you have documented your experience report in the blog post and the slides. Was your corresponding talk recorded as a video? I would love to watch it.
It’s awesome that your app is open-source. In your slides, you wrote:
Each machine shuts down when not in use; restarts on demand
I guess the apps (for the dance events) on the (Fly) machine(s) are running all the time while the machine is running? And then the machine scales to zero if there are no more incoming requests? My current architecture has one VM with a reverse proxy receiving the traffic. It forwards it to a few VMs running Docker containers (one for each customer). To save costs, a customer’s container is stopped if it is not used anymore. Therefore, I needed to write a customer reverse proxy (only HTTP and HTTPS are done by Google Cloud Load Balancer) to intercept the first request and get the chance to start the container before forwarding it.
However, all this is challenging to implement. Therefore, I wanted to check if Fly could help in this regard. The Firecracker VM suspend and resume could reduce the cold start of the customer apps.
I believe that when the video is ready it will be posted to https://www.youtube.com/@CarolinaCodeConference
Basically the other way around for my app. The machines never scale to zero, but a machine can run multiple apps and they each can scale to zero on that machine.
In concrete terms, I host Harrisburg PA’s events in IAD. They had one each in 2022, 2023, and 2024. In 2025, they will likely have another event, and that will be the one that will be most active. The others are still available, but scaled to zero.
The reasons why I don’t scale machines to zero likely don’t apply to you; it mostly comes down to my backup strategy and the fact that I have two orders of magnitude less machines than you plan to have.
Our proxy handles all of that. Again, I’m not currently using suspend, but there is no reason why you shouldn’t.
Feel free to ask questions. And there are people available who can review your architecture and make suggestions – and if I can help, I can join.
Hi @rubys while you’re on the topic of suspend
, can you have someone on the team look at suspend? When my machine suspends, the initial request responds within in 100-200ms, but any event after it wakes up gets blocked for a few seconds by:
ERROR stdout to vsock zero copy err: Socket not connected (os error 107)
So it kinda defeats the purpose of suspend.
Hi! I see that the discussion here has taken another turn, but I logged in to the forum with exactly the same question.
I have a github pull request workflow that creates a database per branch (to be able to run e2e-tests). I use the following cli command
flyctl deploy --app <appname> --ha=false --remote-only --regions fra -c fly.pull-request.toml db
I can confirm that SOMEHOW this command ended up creating 4 machines for me. I have no idea how, and I unfortunately have no trace atm since I solved the situation by destroying the app and then running the workflow again (and this time it worked, creating just one). I dont know what edge-case I stumbled upon. But I swear that I have not used any other command that the one above in github workflow, and that I yet somehow ended up with 4 machines (which caused many hours of debuggin weird random e2e test errors…).
You experts/fly-people, do you have any hint/clue on what could allow this given what I have specified?
I now experienced it again live!
And in the logs I can see
--regions filter applied, deploying to 0/1 machines
Process groups have changed. This will:
* create 1 "app" machine
> Launching new machine
No machines in group app, launching a new machine
I see that when the workflow was running again, it incremented the machine count by 1. I have been on parental leave for 6 months so I assume something has changed during this period. Previously redeploys of the same app did not just add another machine (we would have noticed!).
I cannot understand why I just can not add a -N 1 flag when deploying. Oh well, i thought that -ha false was supposed to be that command but alas.
Oh, I now found that others have had similar issues, so at least I am not alone. Bug? Deploy of multi-process app reactivates process groups that had been set to 0 - #4 by allison
I’ll try setting --update-only
as suggested here but I really would like better ergonomics here