Autoscaling causing 502s

Hi there,

I’ve noticed there is a spike in 502 responses whenever there is a fly.io autoscaling action.

Is this expected? Is there anything we can do to minimise 502 responses?

Cheers,
Stefan

As a side note I’ve noticed when running a scale command my site goes down for 10-15 seconds. While its reasonable, it would be nice if the old pods can stay up while the new ones get created

There’s currently lag replicating our state across our fleet causing this downtime. This shouldn’t be happening when using the “rolling” deployment strategy (the default one, unless you have a volume attached to your VM), but it does happen sometimes due to state replication lag.

We’re working on this right now as one of our top priorities. Apps appearing down is not acceptable.

I’m expecting we’ll launch improvements this week and keep working on it from there.

@charsleysa @nahtnam We’ve launched some improvements that should help a lot with these errors. Let us know how it goes!

Hi Jerome, sorry reviving this old thread but I’m seeing this behavior in the last couple of days.
I’m new to fly and deployed a test app last week and it was working fine with autoscaling to 0. When I visited the site, it would load (after a short bootup time,) but now when I load the app from a stopped state, it returns 502 for about 30 seconds.

I’m a little concerned here for a production app, for example if the containers scales from 1 => 2, the new requests that’s getting routed to the new container might get the 502s.
I’ve seen issues with the opposite, when a container scales from 2 => 1 (eg bluegreen deployment, the request fails b/c it’s trying to communicate with the container that was killed.

Edit:
This was a relatively new error:
instance refused connection. is your app listening on 0.0.0.0:3000? make sure it is not only listening on 127.0.0.1

In my Dockerfile, I added:
ENV HOST 0.0.0.0

Now that error no longer appears and now the logs shows:
machine became reachable in <TIME>ms like it did last week.