URGENT: Request errors

Hi,
We are continuously getting following errors:

2021-10-01T14:03:52.990386768Z proxy[19ce0ab4] sin [error] error.code=2 error.message=“Internal problem” request.method=“POST” request.url=“https://[something.com]/fundreq/callbackStarPaymentApi” request.id=“01FGY2MH336PZXXK56JM5XXC6V” response.status=500

We do not have any deployment in SIN zone. Despite that, the requests are going to SIN zone causing the issues. Please suggest.

This issue started an hour back when we did made a new deployment. Since one of the servers on SIN zone was causing issue, we stopped SIN zone completely and deployed to MAA zone. Still a lot of requests are going to SIN zone.

This particular instance was launched in the SIN region. Do you have SIN set as a backup region? Backup regions are a bit hit or miss, our scheduler might put instances there even if primary regions have enough space.

It looks like your instance didn’t exit cleanly. In some regions, our “state” takes more time to replicate and it’s possible our proxy will still have your old instance in its state but can’t find it on the host. This may happen for a few seconds or sometimes a few minutes in regions further away from North America.

We’re actively working on fixing these delays. We’ve already started rolling out a fix for our private DNS server, and the proxy will soon benefit from the same improvements.

Hi Jerome,
There is no backup region now. Its only MAA. Can you do something now, to get this working. Lots of our users are facing issues.

I’m looking into this now.

@ponchinchon it seems like your Singapore instances are still registered as available in our internal state. We’re troubleshooting this now, normally they clear out after a few minutes (not 1+ hour). You should see the errors subside shortly.

Edit: we’re tracking this on our status page, you should not be seeing errors, but we’re routing traffic through other regions: Fly.io Status - Rerouting around the SIN region

things look okay now. Is there something we can do to avoid this in future. Deployments looked okay. So really don’t know what we could have done differently.

This one’s on us. Nothing you could’ve done. We’ve fixed the network routing issues that prevented our internal state from properly replicating.

Our alert for this didn’t catch the issue, we’re fixing that as well.

This shouldn’t be happening anymore in a few weeks when we roll out big internal state replication changes. Until then we’ll have the alert fixed and we’ll be notified faster when we need to re-route traffic. These issues should be rare to begin with.

1 Like