Autoscaling causing 502s

charsleysa · February 3, 2021, 4:50am

Hi there,

I’ve noticed there is a spike in 502 responses whenever there is a fly.io autoscaling action.

Is this expected? Is there anything we can do to minimise 502 responses?

Cheers,
Stefan

nahtnam · February 3, 2021, 7:40am

As a side note I’ve noticed when running a scale command my site goes down for 10-15 seconds. While its reasonable, it would be nice if the old pods can stay up while the new ones get created

jerome · February 3, 2021, 2:16pm

There’s currently lag replicating our state across our fleet causing this downtime. This shouldn’t be happening when using the “rolling” deployment strategy (the default one, unless you have a volume attached to your VM), but it does happen sometimes due to state replication lag.

We’re working on this right now as one of our top priorities. Apps appearing down is not acceptable.

I’m expecting we’ll launch improvements this week and keep working on it from there.

jerome · February 9, 2021, 5:49pm

@charsleysa @nahtnam We’ve launched some improvements that should help a lot with these errors. Let us know how it goes!

khuezy · October 31, 2023, 2:39pm

Hi Jerome, sorry reviving this old thread but I’m seeing this behavior in the last couple of days.
I’m new to fly and deployed a test app last week and it was working fine with autoscaling to 0. When I visited the site, it would load (after a short bootup time,) but now when I load the app from a stopped state, it returns 502 for about 30 seconds.

I’m a little concerned here for a production app, for example if the containers scales from 1 => 2, the new requests that’s getting routed to the new container might get the 502s.
I’ve seen issues with the opposite, when a container scales from 2 => 1 (eg bluegreen deployment, the request fails b/c it’s trying to communicate with the container that was killed.

Edit:
This was a relatively new error:
instance refused connection. is your app listening on 0.0.0.0:3000? make sure it is not only listening on 127.0.0.1

In my Dockerfile, I added:
ENV HOST 0.0.0.0

Now that error no longer appears and now the logs shows:
machine became reachable in <TIME>ms like it did last week.

Topic		Replies	Views
How long does it take for autoscaling to kick in.	4	447	September 24, 2022
Fly Not Scaling?	2	392	February 3, 2021
Autoscale doesn't seem to work with hard_limit = 1 and soft_limit = 1	13	1317	September 7, 2021
Autoscaling characteristics for v2 apps	4	213	March 7, 2024
fly scale vm causing downtime Questions / Help	4	335	July 12, 2022

Autoscaling causing 502s

Related topics