Weird machine scaling behavior when instance went down

uncvrd · September 25, 2023, 8:24pm

Hi there - I just wanted to report some weird scaling behavior that I came across today…

Yesterday one of my apps (foundry-imgproxy-v2) started returning errors that stated the following (only had one running in LAX at the time):

could not find a good candidate within 90 attempts at load balancing. last error: unreachable worker host. the host may be unhealthy. this is a Fly issue.

and also

could not find a good candidate within 90 attempts at load balancing. last error: no known healthy instances found for route tcp/443. (hint: is your app shutdown? is there an ongoing deployment with a volume or using the 'immediate' strategy? has your app's instances all reached their hard limit?)

Unsure why I randomly got this error, I finally decided to go ahead and scale up another machine in SJC

I ran the following:

fly scale count 2 --region lax,sjc --max-per-region 1 --app foundry-imgproxy-v2

Which confirmed that it would be scaling up a machine in LAX and SJC

This showed two apps running.

Nice. I thought I fixed it.

Now as of a few hours ago, my mysteriously dead machine started running again and now I had 2 in LAX and 1 in SJC instead of just 1 in each region.

Running the same command again:

fly scale count 2 --region lax,sjc --max-per-region 1 --app foundry-imgproxy-v2

Fixed things but is quite unsettling to know that the scaling wasn’t following my instructions

Here you can see this in my Grafana when the app went down, when I scaled up and when the ghost machine came back online

Recap of image:

Blue line at 12:00 is my solo LAX instance
app goes down (?) at 18:00
I run the scale command at 20:00 for LAX,SJC one machine each region
(next day)
12:30 mysteriously comes back online
1pm I run scale command again and now my extra LAX instance (yellow) is turned off

Hope this helps!

system · October 2, 2023, 8:25pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Apps in LAX went down today at 1915 UTC Questions / Help	4	376	February 24, 2022
[PR03] could not find a good candidate within 21 attempts at load balancing. last error: [PU03] unreachable worker host	7	88	February 3, 2025
Downscaled apps keep restarting semi-regularly Questions / Help	4	392	September 24, 2023
Scale count not responsive Phoenix elixir	2	446	December 19, 2022
Why is my App not Scaling Down? Questions / Help autoscaling	22	95	January 20, 2025

Weird machine scaling behavior when instance went down

Related topics