Hi there - I just wanted to report some weird scaling behavior that I came across today…
Yesterday one of my apps (foundry-imgproxy-v2) started returning errors that stated the following (only had one running in LAX at the time):
could not find a good candidate within 90 attempts at load balancing. last error: unreachable worker host. the host may be unhealthy. this is a Fly issue.
and also
could not find a good candidate within 90 attempts at load balancing. last error: no known healthy instances found for route tcp/443. (hint: is your app shutdown? is there an ongoing deployment with a volume or using the 'immediate' strategy? has your app's instances all reached their hard limit?)
Unsure why I randomly got this error, I finally decided to go ahead and scale up another machine in SJC
I ran the following:
fly scale count 2 --region lax,sjc --max-per-region 1 --app foundry-imgproxy-v2
Which confirmed that it would be scaling up a machine in LAX and SJC
This showed two apps running.
Nice. I thought I fixed it.
Now as of a few hours ago, my mysteriously dead machine started running again and now I had 2 in LAX and 1 in SJC instead of just 1 in each region.
Running the same command again:
fly scale count 2 --region lax,sjc --max-per-region 1 --app foundry-imgproxy-v2
Fixed things but is quite unsettling to know that the scaling wasn’t following my instructions
Here you can see this in my Grafana when the app went down, when I scaled up and when the ghost machine came back online
Recap of image:
- Blue line at 12:00 is my solo LAX instance
- app goes down (?) at 18:00
- I run the scale command at 20:00 for LAX,SJC one machine each region
(next day) - 12:30 mysteriously comes back online
- 1pm I run scale command again and now my extra LAX instance (yellow) is turned off
Hope this helps!