We now support setting a minimum number of machines to keep running when using the automatic start/stop feature for Apps v2. This will prevent the specified number of machines from being stopped. Update your flyctl to the latest version and then in your fly.toml
If instances of your application take a while to start and that is unacceptable for your use case, you will benefit having at least 1 instance always running (min_machines_running = 1). When a new request comes in, instead of having to wait for the app to start up in the case it was scaled down completely (i.e the cold start problem), it is able to respond immediately.
What you need to know
The most important is that we only keep instances running in the primary region of your app. All other regions will still get scaled down to 0. As an example, if min_machines_running = 3, then you’ll need 3+ instances in your primary region.
Some other things to know:
The max number of machines we can scale up to is implicitly defined by the number of machines your app has. We will scale your app all the way up if the demand requires it and scale back down to the minimum specified
The default minimum is set to 0
* This does not solve the cold start problem entirely. When a request comes in and the proxy decides to start a new instance, that request waits for the new instance to start. We don’t start a new instance while servicing the current request with an already running instance. So while you may not run into a cold start for your first instance, if we start a second one, that request will run into it. We’re giving some thought to how to solve this and as always, will post on here once we’ve got a solution for you
I have two machines (I cloned one of them just recently).
According to the monitoring and the logs, both machines got scaled down:
2023-05-12T19:05:56.552 proxy [6e82d956a79408] ams [info] Downscaling app peter-kuhmann-website in region ams. Automatically stopping machine 6e82d956a79408. 2 instances are running, 0 are at soft limit, we only need 1 running
2023-05-12T19:05:56.558 app[6e82d956a79408] ams [info] Sending signal SIGINT to main child process w/ PID 513
2023-05-12T19:05:56.746 app[6e82d956a79408] ams [info] Starting clean up.
2023-05-12T19:05:57.746 app[6e82d956a79408] ams [info] [ 405.553727] reboot: Restarting system
2023-05-12T19:07:18.119 proxy [5683d920b1618e] ams [info] Downscaling app peter-kuhmann-website in region ams. Automatically stopping machine 5683d920b1618e. 1 instance is running but has no load
2023-05-12T19:07:18.122 app[5683d920b1618e] ams [info] Sending signal SIGINT to main child process w/ PID 513
2023-05-12T19:07:18.628 app[5683d920b1618e] ams [info] Starting clean up.
2023-05-12T19:07:19.630 app[5683d920b1618e] ams [info] [ 503.747677] reboot: Restarting system
Interesting: 2 instances are running, 0 are at soft limit, we only need 1 running on first downscale it seems to “know” the min setting.
But the second check doesn’t seem to take it into account: 1 instance is running but has no load.
Did I miss a specific configuration or precondition?
This is awesome! I literally made a post about this pain point a few days ago and came to the forums for another reason to see that its implemented as a feature!
I looked at your app and both the instances of your machine are running in the region iad. However, your primary_region is set to ewr. Autostop only keeps machines running in the primary region of your application. Did you do anything that caused your machines to deploy in iad? If not, then its an issue on our side.
Hi Senyo, thanks for helping look into this. I destroyed the iad machines and deployed from my CI build automation, and scaling appears to be working great now. My CI deployment scripts are all configured to deploy to ewr.
What I think happened is when I was setting things up originally and running commands manually on my desktop a several weeks ago to debug various issues, I must have made a typo once and deployed a machine to iad by accident. After that the machine stuck around, I didn’t notice the region was wrong since my deployments were being done with the CI automation which didn’t blow away the extra iad machines.
So thank you for helping identify the issue and pointing it out.
To reduce this type of error in the future, I wonder if there is a way to have the toml file be more explicit about the final state of the deployment so that it could be a single source of truth?
Your service has no publicly exposed ports. The autostart/autostop function is driven by our internal proxy which only knows of your application if you have a service with exposed ports.
But as you can see each service has min_machines_running = 1.
When I have only 2 VMs in the app, this shows -
ord [info] Downscaling app brick-drop-co in region ord from 2 machines to 1 machines. Automatically stopping machine 080e442c5405d8
As you can see, it’s my primary region, and it has ports. I know I can connect too it externally.
Luckily the one it shuts down, is the one that starts up fast. Not sure why it picks web to be the process group that shuts down. Maybe its deploy order?
But it’s not respected, and thinking about setting minimum to 2.
About to clone and setup HA, and add another region, so have not tried this yet, but wanted to report the bug for you all.
Thank you for this, it revealed a bug on our side. We just shipped a fix for this that should be rolled out. Let me know if you’re still having issues.
I have a different issue that I’m trying to fix.
I’ve googled, read the docs and the threads here, but without success.
I have a staging environment to test my application, where I want to stop all machines when there is no load (for cost reduction).
I have the .toml file below. It sets the min_machines_running = 0 and destroys idle machines. However, since this staging environment is used by only 1 or 2 devs (usually not even simultaneously), machines seem to be destroyed even before I can send a second request to my http service. It seems there is a too short idle delay before shutting down the machines. For apps in a production environment with actually concurrent users, that may work fine: when there is no requests, machines can be immediately destroyed. But for staging environments, a minimum idle timeout should happen before destroying machines.
Sometimes I log in and, when I try to click some link in the returned web page, I get disconnected afterwards. My app uses session authentication, so it seems the VMs are destroyed and the session is cleared.
I’ve tried to change type from “requests” to “connections” but without success.