[NEW] Could not proxy HTTP request. Retrying in 1000 ms

I was planning to launch a production app using flyio; I didn’t want to worry about EKS/Grafana/so on at this stage.

Let’s wait for the official report about the incident to decide. Regardless if so, a contingency plan to launch in another place will be needed

This is what I saw in lax/sea this morning:

Downtime started after I deployed, and stayed down after repeated re-deploys.

I periodically tried deploying or changing region through the downtime. Moving from sea to lax eventually brought it back but required manual intervention.

Thanks for this tip. Changing regions also brought my app back… after 5 hours of downtime. :grimacing:

@bcomnes,
is your app multi-region?

No, not yet, but I moved region to try and bring it back. Still pre-launch.

Hi anyone from Fly.io, can we get a post-mortem of why this happened, and how we can protect ourselves from it happening again?

For example, would being multi-region have saved some of us?

2 Likes

We’ll get more details once we’ve completely recovered. Everything should be good today, but we’re still digging to make sure we’ve fully diagnosed the problem. At the moment, we’re primarily focusing communications on customers with premium email support.

This specific failure would be hard to prevent. Our gossip based service discovery had issues propagating information after deploys. Apps that didn’t get deployed continued to work fine, but some percentage of deploys corrupted their service data.

The new Machines based apps we’re shipping will be more resilient to this kind of problem, since deploys won’t churn service discovery data, but it’s not a complete fix.

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.