Weird machine scaling behavior when instance went down

Hi there - I just wanted to report some weird scaling behavior that I came across today…

Yesterday one of my apps (foundry-imgproxy-v2) started returning errors that stated the following (only had one running in LAX at the time):

could not find a good candidate within 90 attempts at load balancing. last error: unreachable worker host. the host may be unhealthy. this is a Fly issue.

and also

could not find a good candidate within 90 attempts at load balancing. last error: no known healthy instances found for route tcp/443. (hint: is your app shutdown? is there an ongoing deployment with a volume or using the 'immediate' strategy? has your app's instances all reached their hard limit?)

Unsure why I randomly got this error, I finally decided to go ahead and scale up another machine in SJC

I ran the following:

fly scale count 2 --region lax,sjc --max-per-region 1 --app foundry-imgproxy-v2

Which confirmed that it would be scaling up a machine in LAX and SJC

This showed two apps running.

Nice. I thought I fixed it.

Now as of a few hours ago, my mysteriously dead machine started running again and now I had 2 in LAX and 1 in SJC instead of just 1 in each region.

Running the same command again:

fly scale count 2 --region lax,sjc --max-per-region 1 --app foundry-imgproxy-v2

Fixed things but is quite unsettling to know that the scaling wasn’t following my instructions

Here you can see this in my Grafana when the app went down, when I scaled up and when the ghost machine came back online

Recap of image:

  • Blue line at 12:00 is my solo LAX instance
  • app goes down (?) at 18:00
  • I run the scale command at 20:00 for LAX,SJC one machine each region
    (next day)
  • 12:30 mysteriously comes back online
  • 1pm I run scale command again and now my extra LAX instance (yellow) is turned off

Hope this helps!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.