migrate-to-v2 Now Supports Nomad Autoscaling Apps

We’ve been working hard to get all of our apps on to Apps V2. While lots of apps have been converted without a hitch, we found some that were incompatible for one reason or another. One of those reasons was that Apps v1 supported autoscaling, meaning that allocs could automatically be created and destroyed based on traffic being sent. We don’t have that exact same feature ready for Apps V2, but we have a pretty similar one called autostart/stop. In order to make sure that autoscaled apps will still scale up and down, we create autoscale_max - autoscale_min number of machines configured to autostart and autostop, as well as autoscale_min number of machines that will just always stay running.

This new migrate-to-v2 feature should be available in v0.1.29.

7 Likes

Okay, how do I set this parameter for v2 apps?

And also, auto start / stop not respecting kill_signal and kill_timeout (defined in fly.toml) is a deal breaker as our code is reliant on it for graceful exits.

To do something like this in Apps V2, you’d wanna set the following options for each service you have:

[[services]]
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = <min_machines>

Also, where did you see that autostart/stop doesn’t respect kill signal and kill timeout? Is it just something you observed?

1 Like

Thanks. Is max_machines_running settable, too?

Here:

So there isn’t a way to set it in fly.toml that I’m aware of, but by setting autostart and autostop in fly.toml, you can use fly scale count or fly machine clone to make as many standby machines as you’d like.

EDIT: The section below is wrong, actually. We’re working on fixing this.
As for kill_timeout and kill_signal being respected in fly.toml, it looks like that should now be the case (from my testing).

1 Like

Great news! Autostop should now respect kill_timeout and kill_signal

1 Like

Is it possible to define the maximum number of machines per region (equivalent to --max-per-region for nomad instances)?

Oh! I already found in the documentation how to do that using machines.
Sorry!

3 Likes

I moved my apps to this. It works! But some logs indicate concerns? Not sure if they’re benign:

# sigint on auto-exit
2023-06-13T07:52:56Z app[9e784e671b4283] bom [info]2023-06-13T07:52:56.385991Z  INFO init::process: Sending signal SIGINT to main child process w/ PID 513

2023-06-13T07:52:56Z app[9e784e671b4283] bom [info]stopping proc, times-up

2023-06-13T07:52:56Z app[9e784e671b4283] bom [info]2023-06-13T07:52:56.679927Z  INFO libinit::linux::proc: Main child exited normally with code: 0

2023-06-13T07:52:56Z app[9e784e671b4283] bom [info]2023-06-13T07:52:56.682032Z  WARN init::hallpass: hallpass exited, pid: 514, status: signal: 15 (SIGTERM)

# all good so far, the machine has quit gracefully

...

# but then... restarting system?

2023-06-13T07:52:57Z app[9e784e671b4283] bom [info][  341.536563] reboot: Restarting system

# health check failures?
2023-06-13T07:53:05Z health[9e784e671b4283] bom [error]Health check on port 8080 has failed. Your app is not responding properly. Services exposed on ports [443, 8080] will have intermittent failures until the health check passes.

restart policy (rdns-dev) is left to default (from flyctl m status 9e784e671b4283 -d): "restart": {}

The reboot is a firecracker implementation detail – you can shutdown a firecracker instance by issuing a reboot which is what we do. It’s benign and we should probably hide it to limit any confusion.

On the health check, in this context it seems to be failing because your machine is stopped, hence the health checks fail.

1 Like

Thanks @senyo

After moving to auto-start/auto-stop I see machines wound down in the logs often followed by immediately the same machine starting back up:

# quitting because there's 2 machines?
2023-06-13T15:20:15Z proxy [e148e452addd89] yyz [info]Downscaling app udns in region yyz. Automatically stopping machine e148e452addd89. 2 instances are running, 0 are at soft limit, we only need 1 running

# starting back up after 5s ...
2023-06-13T15:20:20Z app[e148e452addd89] yyz [info]2023-06-13T15:20:20.240Z I NodeJs http-check listening on: [::]:8888
2023-06-13T15:20:20Z app[e148e452addd89] yyz [info]2023-06-13T15:20:20.241Z I NodeJs DoT listening on: [::]:10000
2023-06-13T15:20:20Z app[e148e452addd89] yyz [info]2023-06-13T15:20:20.241Z I NodeJs DoH listening on: [::]:8080
2023-06-13T15:20:20Z proxy[e148e452addd89] yyz [info]machine became reachable in 618.289451ms

Curiously, there’s only one machine in yyz. Btw, this happens with most other regions. Here’s the scale characteristic for the udns app:

➜ fly scale show -a udns  
VM Resources for app: udns

Groups
NAME	COUNT	KIND  	CPUS	MEMORY	REGIONS                                                                                                                                                
app 	39   	shared	1   	256 MB	ams(2),arn,atl,bog,bom(2),bos,cdg,den,dfw,ewr,eze,fra(2),gdl,gig,gru,hkg,iad,jnb,lax,lhr(2),mad,mia,nrt,ord,otp,phx,qro,scl,sea,sin(2),sjc,syd,yul,yyz

Only regions ams, bom, fra, and sin have 2 machines, while the rest don’t. Am I hitting a bug here wrt Fly expecting at least 2 machines per region for all regions?

This shouldn’t be the case, it means there’s a bug on our side.

Fwiw, you’re not the first person to report this so we’ll give it a look. Are you by any chance using websockets? That’s seemed to cause trouble in the past.

No websockets, but raw TCP.

Hopefully, gets fixed soon. Thanks

1 Like

We’re rolling out a fix for this now, should be complete in the next couple of minutes. Let me know if you’re still seeing the issue

1 Like