inconsistent server reliability

Hey. Been working on deployment for an app over the weekend and i’ve been seeing really weird behaviour and fairly random reliability with the app.

The app sees to start timing out pretty frequently. It will be working fine and then after another deploy i’ll start seeing the timeouts. The root of the app does a redirect to google single sign on which will work but then the redirect back to the app times out. It never seems to get stuck on that first redirect.

The app is a python flask based project with a docker image based deployment. I’ve got gunicorn running with two works on a 1x dedicated cpu instance with 2gb of ram.

Im sure it’s not the code as it will typically start working fine again after the next deployment.

It’s very confusing and bit concerning as i was planning to hand it over to the client but dont’ wish to do so until it’s reliable.

can anyone from the fly.io team look into this for me ?

I looked into this a little bit. These kinds of issues are pretty concerning.

I believe your app was running on a node which had some troubles last week and it looks like we forgot to restart one of the crucial services after we fixed the issue.

I see your instance is now on a different server and I expect things are better?

I think this server was unreachable from some of our proxies.

At least that’s my current theory.