Intermittent Fly Proxy error causing short periods of downtime on prod

pheuter · September 21, 2022, 11:23pm

Not seeing anything on https://status.flyio.net/ so sharing here, two separate apps that we have deployed went down for about 2 minutes, one around midnight last night (EDT) and the other just a few minutes ago at 7:20pm EDT. This error in the logs is the only indication we have, other than getting a 502 Bad Gateway when navigating to the URLs:

Sep 21 19:19:18 dc753fab vector Error Error: error while making HTTP request to app: connection error: connection reset

I suspect it must have something to do with the fly proxy since I’m seeing event.provider=proxy in the log metadata object:

{
  "event": {
    "provider": "proxy"
  },
  "fly": {
    "app": {
      "instance": "dc753fab",
      "name": "faq"
    },
    "region": "yyz"
  },
  "host": "392f",
  "log": {
    "level": "error"
  }
}

Worth noting that the app is deployed to iad, despite the log showing it came from yyz.

Edit: May be of interest that there’s a pretty big CPU spike around the time of the outage:

tvdfly · September 22, 2022, 9:52pm

We found a memory leak in our proxy in a codepath that recently began getting called more often. This resulted in some instances becoming inaccessible for brief periods when the proxy would run out of memory.

We deployed a fix for the memory leak today, which fixes this issue.

Sorry for the trouble!

ignoramous · October 24, 2022, 10:03pm

I guess the real question is… how are you keeping tabs (real-time?) on merely 2 mins in downtime? Is the app single VM and/or single region?

pheuter · October 24, 2022, 10:18pm

30-second health checks using updown.io

Once LiteFS supports WAL-mode we may be able to deploy multiple instances, for now we’re limited to 1 that’s volume-bound.

Topic		Replies	Views
Application Sporadically Down With 502 JavaScript sqlite , nodejs , litefs	5	290	August 19, 2023
502/503 status codes	5	889	September 22, 2022
Global outage (maybe already recovering) just now? proxy	5	131	December 19, 2024
Error logs saying "Internal problem" result in 502s	10	455	August 16, 2021
Something not right on Fly.io	35	1946	March 4, 2023

Intermittent Fly Proxy error causing short periods of downtime on prod

Related topics