Autoscaling is not triggered on a pure websocket application

Yep, scaling manually works:

$ fly scale count 3
Count changed to 3
$ fly status
App
  Name     = xxx
  Owner    = jitter
  Version  = 37
  Status   = running
  Hostname = xxx

Instances
ID              PROCESS VERSION REGION  DESIRED STATUS  HEALTH CHECKS           RESTARTS        CREATED
68853dcd        app     37      cdg     run     running 1 total                 0               9s ago
01fdbc50        app     37      lax     run     running 1 total, 1 passing      0               55m3s ago
c1105a0d        app     37      cdg     run     running 1 total, 1 passing      1               55m23s ago

And I configured 2 regions in the pool:

$ fly regions list
Region Pool:
cdg
lax
Backup Region:
2 Likes

Hello! We just tracked down an internal bug in the autoscaling service that was causing scaling events to not get triggered properly- sorry for the inconvenience. We’ve deployed a fix so it should be working correctly now, please give it another try!

3 Likes

Yep all good now, thanks :slight_smile:

The autoscaling worked perfectly for a day or two, and now the app is stuck again on 1 instance

We’ve been investigating the scaling behavior in your websocket application since yesterday.
We identified a bug in the fly_app_concurrency metric (which is used to make autoscaling decisions), that incorrectly lowers the value to zero if there are no changes in concurrency for 60 (edit: 30) seconds. Any app that maintains many existing, long-lived connections without frequent connects/disconnects (which seems to be the base for your application’s persistent websocket connections) will hit this edge-case, which will cause scaling to not occur as expected.
We’re working on a fix and I will let you know when it’s deployed.

Ok cool. FYI the app is seeing quite a few connections all the time, so the concurrency is not constant as you describe:

2022-07-28T20:23:09Z app[f6dfa2e3] lax [info]20:23:09.664 [info] CONNECTED TO JitterBackendWeb.UserSocket in 55µs
2022-07-28T20:23:09Z app[f6dfa2e3] lax [info]  Transport: :websocket
2022-07-28T20:23:09Z app[f6dfa2e3] lax [info]  Serializer: Phoenix.Socket.V2.JSONSerializer
2022-07-28T20:23:09Z app[f6dfa2e3] lax [info]  Parameters: %{"user_id" => "99ef97e0-0e97-11ed-80ec-8d47cb9069cc", "vsn" => "2.0.0"}
2022-07-28T20:23:09Z app[9a0129e3] cdg [info]20:23:09.935 [info] CONNECTED TO JitterBackendWeb.UserSocket in 80µs
2022-07-28T20:23:09Z app[9a0129e3] cdg [info]  Transport: :websocket
2022-07-28T20:23:09Z app[9a0129e3] cdg [info]  Serializer: Phoenix.Socket.V2.JSONSerializer
2022-07-28T20:23:09Z app[9a0129e3] cdg [info]  Parameters: %{"user_id" => "6e173430-d68d-11ec-ae95-f9189a6be20d", "vsn" => "2.0.0"}
2022-07-28T20:23:10Z app[f6dfa2e3] lax [info]20:23:10.173 [info] CONNECTED TO JitterBackendWeb.UserSocket in 71µs
2022-07-28T20:23:10Z app[f6dfa2e3] lax [info]  Transport: :websocket
2022-07-28T20:23:10Z app[f6dfa2e3] lax [info]  Serializer: Phoenix.Socket.V2.JSONSerializer
2022-07-28T20:23:10Z app[f6dfa2e3] lax [info]  Parameters: %{"user_id" => "03e93b30-0c7f-11ed-bef6-0327b45a4e9c", "vsn" => "2.0.0"}
2022-07-28T20:23:10Z app[9a0129e3] cdg [info]20:23:10.617 [info] CONNECTED TO JitterBackendWeb.UserSocket in 49µs
2022-07-28T20:23:10Z app[9a0129e3] cdg [info]  Transport: :websocket
2022-07-28T20:23:10Z app[9a0129e3] cdg [info]  Serializer: Phoenix.Socket.V2.JSONSerializer
2022-07-28T20:23:10Z app[9a0129e3] cdg [info]  Parameters: %{"user_id" => "bc2a66a0-0d7a-11ed-b652-63f3f415c476", "vsn" => "2.0.0"}
2022-07-28T20:23:10Z app[f6dfa2e3] lax [info]20:23:10.859 [info] CONNECTED TO JitterBackendWeb.UserSocket in 57µs
2022-07-28T20:23:10Z app[f6dfa2e3] lax [info]  Transport: :websocket
2022-07-28T20:23:10Z app[f6dfa2e3] lax [info]  Serializer: Phoenix.Socket.V2.JSONSerializer
2022-07-28T20:23:10Z app[f6dfa2e3] lax [info]  Parameters: %{"user_id" => "28fc3ec0-eb21-11eb-a86a-d955fad77a38", "vsn" => "2.0.0"}
2022-07-28T20:23:12Z app[f6dfa2e3] lax [info]20:23:12.511 [info] JOINED project:BlmpqXs5hEfmiqRsw15ZF in 29µs
2022-07-28T20:23:12Z app[f6dfa2e3] lax [info]  Parameters: %{}
2022-07-28T20:23:12Z app[9a0129e3] cdg [info]20:23:12.894 [info] CONNECTED TO JitterBackendWeb.UserSocket in 79µs
2022-07-28T20:23:12Z app[9a0129e3] cdg [info]  Transport: :websocket
2022-07-28T20:23:12Z app[9a0129e3] cdg [info]  Serializer: Phoenix.Socket.V2.JSONSerializer
2022-07-28T20:23:12Z app[9a0129e3] cdg [info]  Parameters: %{"user_id" => "6e173430-d68d-11ec-ae95-f9189a6be20d", "vsn" => "2.0.0"}
2022-07-28T20:23:13Z app[f6dfa2e3] lax [info]20:23:13.481 [info] CONNECTED TO JitterBackendWeb.UserSocket in 54µs
2022-07-28T20:23:13Z app[f6dfa2e3] lax [info]  Transport: :websocket
2022-07-28T20:23:13Z app[f6dfa2e3] lax [info]  Serializer: Phoenix.Socket.V2.JSONSerializer
2022-07-28T20:23:13Z app[f6dfa2e3] lax [info]  Parameters: %{"user_id" => "99ef97e0-0e97-11ed-80ec-8d47cb9069cc", "vsn" => "2.0.0"}

Please let me know if I can help.

I double-checked and the metric-timeout that triggers the scaling bug is actually 30 seconds, not 60.

As long as the app has a steady stream of connects/disconnects the concurrency metric will be accurate, but any >30 second gaps cause the metric to reset to zero and the concurrency metric to remain incorrectly low.

Though your app does generally have quite a few connections, looking at the metrics there were a few brief gaps (for instance: 2022-07-27 from 04:51:42 → 04:52:20 UTC) that caused the concurrency metric to drop lower than it should, keeping your app scaled at 1 instance.

1 Like

Is this using soft-limit or hard-limit? I’m trying to size things correctly myself.

It’s using both, soft_limit=400 and hard_limit=500, so the autoscaling kicks in before reaching the connections limit.

1 Like

Yah, this is working for me pretty nicely now. Now just to figure out how to tune/balance the regions appropriately…

Update: a fix has now been deployed, the fly_app_concurrency metric should now remain accurate for applications with many long-lived connections like websockets, so autoscaling will trigger more reliably.

Thanks for reporting this issue!

3 Likes

Hello again,

I have been starting new experiments to size the memory consumption of our app. Initially hard_limit and soft_limit were too high (500 and 400), and I started getting OOM errors, which made VMs crash (I think), and in turn made autoscaling go from 2 to the max of 10 instances.

I tried to lower the limits in fly.toml to 250 / 200, then to 100 / 50, each time doing new flyctl deploy but that didn’t help. I had to scale the VMs memory up to stop them restarting over and over again. Then I retried to do a deployment and the limits finally seemed to settle to 100 / 50.

Now I would like to push the limits up to 200 / 100 again, but fly deploy doesn’t seem to do anything, the VMs are still on a hard limit of 100:

2022-08-30T14:31:49Z proxy[b5a07233] cdg [warn]Instance reached connections hard limit of 100

Furthermore, the cluster now seems to be stuck on 7 instances:

What should I do to change soft_limit and hard_limit?

The auto-scaling seems to be broken again, now capping connections at soft_limit (250 here):

If connections are being rerouted to other less-loaded instances once the soft_limit is reached on an instance, that’s exactly how the proxy’s load-balancing behavior is designed to work. Autoscaling adds and removes instances so that there’s enough total capacity for the current number of connections to run within the soft_limit, and scales up when the total load exceeds that. In other words, when all of the instances reach the soft_limit I would expect a new instance to be added, and that’s exactly what I’m seeing here, so it seems like it’s working as expected.

2 Likes

I see, our need is more to start instances where there are more users rather than balance them globally. That is actually why I chose the “standard” autoscaling mode, thinking “balanced” would do the opposite. Thanks for your explanation it is very clear now :+1:

I switched to “balanced” yesterday but VMs still cap at 250 connections instead of starting a new VM in the same region.

$ flyctl autoscale show
     Scale Mode: Balanced
      Min Count: 3
      Max Count: 15

Ah, I think I understand what you are looking for now (instances to scale up in the specific region where soft limit has been reached). Indeed, the existing documentation did describe this kind of region-aware behavior:

  • Standard: Instances of the application, up to the minimum count, are evenly distributed among the regions in the pool. They are not relocated in response to traffic. New instances are added where there is demand, up to the maximum count.

I dug into this recently, and found that our current (nomad-based) autoscaler implementation was deployed around Sept 2021, but this documentation was originally written for an earlier system that had somewhat different behavior. Most notably, there’s only one ‘mode’ in the current autoscaling system (and I believe the scaling action is a simple adjustment of the total instance count), so setting ‘balanced’ mode no longer has any effect.

Apologies for the confusion our quite out-of-date documentation has caused on this. I’ve prepared some updates to the documentation and flyctl that should clarify the current behavior a bit better moving forward.

That said, it does look like the current autoscaler is not quite capable of what you’re looking for. Our current efforts have been focused on adding features to Machines to help manage more advanced scaling requirements like this. Autoscaling apps on machines isn’t quite ready yet, but it’s something we’re actively working on and should eventually be able to better handle your use-case.

1 Like

Ok, I think we will disable autoscaling for launch and revisit the topic later then. We don’t have too much traffic for now so we can totally live with a fixed set of VMs in specific spots around the globe (the main goal being optimal latency).

However I’m not sure how to do that. I have this set of regions configured:

Region Pool:
cdg
hkg
lax
maa
scl
syd
Backup Region:
ams
sjc

How can I start a VM in each of these regions? I tried to disable autoscaling and to set the VM count to 6 with flyctl scale count 6, hoping it would fill the spots, but instead it started 4 in lax and hkg:

Instances
ID      	PROCESS	VERSION	REGION	DESIRED	STATUS 	HEALTH CHECKS     	RESTARTS	CREATED
6a3dd886	app    	950    	hkg   	run    	pending	                  	0       	8s ago              	
10cbf6c9	app    	950    	lax   	run    	pending	                  	0       	8s ago              	
2c3a79e6	app    	950    	lax   	run    	pending	                  	0       	8s ago              	
dd84e7c3	app    	950    	hkg   	run    	pending	                  	0       	8s ago              	
d24f1a6d	app    	950    	scl   	run    	running	2 total, 2 passing	0       	2h24m ago           	
784474bd	app    	950    	cdg   	run    	running	2 total, 2 passing	0       	2022-10-21T13:20:11Z	

I also tried switching back to autoscaling with a min count of 6 but got the same result.

Fly’s main selling point is, for us, starting VMs close to users, how can I do that? :slight_smile:

The --max-per-region option to flyctl scale count is probably what you’re looking for. For example flyctl scale count 6 --max-per-region 1 should start a single VM in each of 6 different regions.

2 Likes

Thanks that should work for us. Do you have plans for the autoscaling that would better fit our needs? Is there an issue, a changelog, or a Twitter account I can track to keep updated on this subject?