Automatically starting/stopping Apps v2 instances

Built-in health-checks ([[services]]) shouldn’t but they did. At least for us. And so, we had to remove them: github/serverless-dns/pull/148

Custom health-checks ([checks]) haven’t been waking up Machines (onboarded on to Apps V2), however.

1 Like

Those didn’t wake machines up, they prevented the in-VM proxy from shutting down. When our proxy starts/stops things, it’s not even aware of health checks.

1 Like

That’s part of our efforts to improve Fly apps’ availability. See Increasing Apps V2 availability.

Awesome, great to hear. There was one other small fix that we deployed to get it working consistently. If come across any issues again, please let us know!

2 Likes

I’ve migrated an internal low-volume app from v1 to v2 where a single node is enough, but now if the host goes down, my service will become entirely unavailable. For that reason I wanted to add another “standby” node with auto-start/stop, but it’s an internal service that doesn’t even have a service section (It’s being called from a different app using top1...)

I presume it’s not possible (at least at the moment) to do auto-start/stop with internal network services, right?

2 Likes

Yep, unfortunately this won’t work. You’d have to start/stop the machine manually via the API.

That’s unfortunate, any plans to support this in the future ?

What would be the recommendation to handle cases where you have one gateway fly app exposed to the internet and with auto{stop,start} enabled and routing requests to backend apps that are internal services.
Perhaps one way to handle this is by having some sort of notification mechanism just like with AWS spot interruption notifications that an app process could check continuously and trigger stop of dependent apps ?

1 Like

Not at the moment. It is something we’ve thought about before but it’s quite complex to do and we just haven’t found the time to dedicate to solving it yet.

If you want to take advantage of the autostart/autostop feature directly and you’re fine with defining [[services]] for your internal apps, you could do that and then ensure all the internal services have a Flycast IP and no public IPs. Communicating over Flycast will make this feature available to you.

Alternatively, you’d have to implement the start/stop functionality in your system. One way you could do that is by having your app start the "standby’ if it fails to connect to the primary machine. There’s likely other topologies that would make sense depending on how your system is put together.

3 Likes

Thanks @senyo . I’m not against using Flycast, as that should bring most of the features to internal apps/services. Is there any downsides of going through that route ?

If you need control over routing, i.e exactly which machine a request is sent to, you lose that control using Flycast (unless you use fly-replay). Otherwise, there’s no downside to using Flycast

1 Like

It works perfectly now, thanks!

1 Like

Is anyone having trouble with auto_stop_machines today in AMS ?
Yesterday the proxy started my app on demand, but today it doesn’t work anymore. The machine stays suspended, and doesn’t receive any signal to start again.

auto_stop_machines works great and downscale everything.

Got :

Failed to proxy HTTP request (error: no known healthy instances found for route tcp/443. (hint: is your app shutdown? is there an ongoing deployment with a volume or using the 'immediate' strategy? if not, this could be a delayed state issue)). Retrying in 947 ms (attempt 90)

I don’t have attached volumes to this app, and the proxy has shutdown the app. No ongoing deployment.
Yesterday, it worked fine.

1 Like

We deployed a change yesterday that caused this regression. We’re reverting it at the moment, it should start working soon.

1 Like

It worked again less than an hour after your message.

1 Like

Love this feature. I have a perfect case for it - an instance of image proxy, which is needed on demand only. The stopped machine seems to get started within 0.1-0.5 seconds which is fine for me.

6 Likes

Is the kill_timeout setting taken into account now?

Not yet, but thanks for the reminder. I’ll look into it!

1 Like

If the proxy respected kill_timeout and kill_signal, that’d be nice. Any timelines?

Also:

Does the above condition hold when auto_start_machine and auto_stop_machine are not used? This usecase was unsupported before [0]. From my experience, multiple Machines in the same region when spun up never did go idle, as in, if two Machines in a region xyz were spun up, then both would get sent incoming connections despite both being well below their soft_limits. Ideally, I’d expect Fly-Proxy to pick one Machine over the other until soft_limit was breached) .

Or, should I use --ha=false flag, like mentioned here? fly migrate-to-v2 - Automatic migration to Apps V2 - #45 by JP_Phillips


[0]

flyd now respects those two configuration options on a machine.

I can’t speak to the load balancing decisions of the proxy with respect to soft_limits.

1 Like

I was excited to use this feature to automatically start and stop new web servers as needed for my small Mastodon instance, since traffic varies a lot.

Strangely, it often downscales a machine and then immediately restarts that machine despite no increase in requests. As a result, I often have two machines running (with one of them frequently restarting) even when the request load is well below the configured soft limit.

Here’s what I see in the logs when this happens:

2023-05-26T22:40:41Z proxy [e2865756b1e486] sea [info]Downscaling app pie-gd-mastodon-v2 in region sea. Automatically stopping machine e2865756b1e486. 2 instances are running, 0 are at soft limit, we only need 1 running
2023-05-26T22:40:41Z app[e2865756b1e486] sea [info]metrics   | Interrupting...
2023-05-26T22:40:41Z app[e2865756b1e486] sea [info]rails     | Interrupting...
2023-05-26T22:40:41Z app[e2865756b1e486] sea [info]streaming | Interrupting...
2023-05-26T22:40:41Z app[e2865756b1e486] sea [info]caddy     | Interrupting...
2023-05-26T22:40:41Z app[e2865756b1e486] sea [info]metrics   | ts=2023-05-26T22:40:41.238Z caller=main.go:542 level=info msg="Received os signal, exiting" signal=interrupt
2023-05-26T22:40:41Z app[e2865756b1e486] sea [info]streaming | WARN Worker 1 exiting
2023-05-26T22:40:41Z app[e2865756b1e486] sea [info]metrics   | signal: interrupt
2023-05-26T22:40:41Z app[e2865756b1e486] sea [info]caddy     | signal: interrupt
2023-05-26T22:40:41Z app[e2865756b1e486] sea [info]streaming | WARN Worker 1 exiting
2023-05-26T22:40:41Z app[e2865756b1e486] sea [info]rails     | - Gracefully stopping, waiting for requests to finish
2023-05-26T22:40:41Z app[e2865756b1e486] sea [info]rails     | === puma shutdown: 2023-05-26 22:40:41 +0000 ===
2023-05-26T22:40:41Z app[e2865756b1e486] sea [info]rails     | - Goodbye!
2023-05-26T22:40:41Z app[e2865756b1e486] sea [info]rails     | Exiting
2023-05-26T22:40:41Z app[e2865756b1e486] sea [info]Sending signal SIGINT to main child process w/ PID 513
2023-05-26T22:40:41Z app[e2865756b1e486] sea [info]streaming | signal: interrupt
2023-05-26T22:40:41Z app[e2865756b1e486] sea [info]streaming |
2023-05-26T22:40:41Z app[e2865756b1e486] sea [info]rails     | signal: interrupt
2023-05-26T22:40:41Z app[e2865756b1e486] sea [info]Starting clean up.
2023-05-26T22:40:41Z app[e2865756b1e486] sea [info]hallpass exited, pid: 514, status: signal: 15
2023-05-26T22:40:41Z app[e2865756b1e486] sea [info]2023/05/26 22:40:41 listening on [fdaa:0:d7b2:a7b:124:eace:e441:2]:22 (DNS: [fdaa::3]:53)
2023-05-26T22:40:42Z app[e2865756b1e486] sea [info][  355.537693] reboot: Restarting system
2023-05-26T22:40:52Z proxy[e2865756b1e486] sea [info]Starting machine
2023-05-26T22:40:53Z app[e2865756b1e486] sea [info]Starting init (commit: 9bb7ee8)...
2023-05-26T22:40:53Z app[e2865756b1e486] sea [info]Preparing to run: `hivemind Procfile.mastodon` as mastodon
2023-05-26T22:40:53Z app[e2865756b1e486] sea [info]2023/05/26 22:40:53 listening on [fdaa:0:d7b2:a7b:124:eace:e441:2]:22 (DNS: [fdaa::3]:53)
2023-05-26T22:40:53Z app[e2865756b1e486] sea [info]rails     | Running...
2023-05-26T22:40:53Z app[e2865756b1e486] sea [info]streaming | Running...
2023-05-26T22:40:53Z app[e2865756b1e486] sea [info]caddy     | Running...
2023-05-26T22:40:53Z app[e2865756b1e486] sea [info]metrics   | Running...
2023-05-26T22:40:53Z proxy[e2865756b1e486] sea [info]machine started in 278.142947ms

This app uses Hivemind to start a few processes (Rails, Node.js, Caddy, and a statsd exporter), but other than that it’s not doing anything special.

All the config files I use for this server are public, and you can see them here. Have I misconfigured something, or is this possibly a bug?

Unfortunately, not at the moment. The proxy team capacity is spread thin right now. We will likely have time once a good set of the Apps v2 migration is completed.

This behaviour is still the same. Requests are load balanced between all your running machines. It’s effectively a round robin load balancing approach.

1 Like