Intermittent Fly Proxy error causing short periods of downtime on prod

Not seeing anything on https://status.flyio.net/ so sharing here, two separate apps that we have deployed went down for about 2 minutes, one around midnight last night (EDT) and the other just a few minutes ago at 7:20pm EDT. This error in the logs is the only indication we have, other than getting a 502 Bad Gateway when navigating to the URLs:

Sep 21 19:19:18 dc753fab vector Error Error: error while making HTTP request to app: connection error: connection reset

I suspect it must have something to do with the fly proxy since I’m seeing event.provider=proxy in the log metadata object:

{
  "event": {
    "provider": "proxy"
  },
  "fly": {
    "app": {
      "instance": "dc753fab",
      "name": "faq"
    },
    "region": "yyz"
  },
  "host": "392f",
  "log": {
    "level": "error"
  }
}

Worth noting that the app is deployed to iad, despite the log showing it came from yyz.

Edit: May be of interest that there’s a pretty big CPU spike around the time of the outage:

1 Like

We found a memory leak in our proxy in a codepath that recently began getting called more often. This resulted in some instances becoming inaccessible for brief periods when the proxy would run out of memory.

We deployed a fix for the memory leak today, which fixes this issue.

Sorry for the trouble!

2 Likes

I guess the real question is… how are you keeping tabs (real-time?) on merely 2 mins in downtime? Is the app single VM and/or single region?

30-second health checks using updown.io

Once LiteFS supports WAL-mode we may be able to deploy multiple instances, for now we’re limited to 1 that’s volume-bound.

1 Like