fly deploy stuck on pushing

rohitpaulk · May 10, 2024, 1:55pm

Ours went through just now too. Fwiw, this incident lasted more than an hour, and we haven’t seen anything on the status page. I first observed this at 1:27PM UK time, and saw these failures all the way till 2:29PM.

Would really like to know:

Why this wasn’t mentioned on the status page, even though a fix was apparently pushed
What the root cause was here, and what changes are being made to prevent such issues in the future
Why this wasn’t caught by automated monitoring, and whether changes are being made to automatically detect these issues in the future

scottp · May 10, 2024, 3:40pm

^^ 100%. This is basic incident response guys. Minimum bar.

wjordan · May 10, 2024, 7:25pm

Hi folks, here’s a few more details I’ve collected on what happened with this issue, to the best of my current knowledge:

We developed an update to fly-proxy (our globally-distributed load balancer) to fix a concurrency bug related to Dynamic Request Routing, that was causing issues using fly-replay with very large HTTP request bodies.

We deployed the update to a small number of hosts in ams (<10%) at 2024-05-07T13:50:00Z, followed by all of ams at 2024-05-08T13:20:00Z, cdg at 2024-05-10T11:20:00Z, and a broader rollout across all regions at 2024-05-10T12:30:00Z. The update was rolled back everywhere at 2024-05-10T13:30:00Z as soon as we identified the root cause of the issue.

The update had a bug that caused some large HTTP request bodies to get stuck, which mostly impacted large registry-image pushes. According to our deployment metrics, this bug was present on 10% of ams for ~72 hours, all of ams for ~48 hours, cdg for ~2 hours, and globally for ~1 hour before it was rolled back. The rollback was deployed immediately after we detected the issue so we didn’t post a status-page update for this incident, though I’ve now posted a back-dated incident for historical purposes.

The bug presented a novel failure mode and impacted such a small fraction of registry requests overall that it didn’t trip any of our existing proxy or registry monitoring, and we didn’t link the builder issues to the proxy update until we rolled out the update more broadly. When we initially investigated the deploy issue in ams, we moved app builders out of AMS as a workaround and suspected possible upstream network issues with S3, which turned out to be incorrect and unrelated to the root cause.

Moving forward, we’ve identified a metrics sensitive enough to detect this particular failure mode in the registry (some of the push failures returned a rare 499 response code) which should help us detect and prevent regressions.

Sorry for the interruption and I hope these extra details provide helpful context.

almet · May 10, 2024, 10:47pm

Hi, thanks for the constructive feedback and information on this one. I guess it’s been a hard time for you and the team, so thanks for pointing out and explaining all this.

rohitpaulk · May 10, 2024, 11:15pm

Appreciate the update - thank you!

system · May 17, 2024, 11:16pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.