App down

jascha · July 28, 2022, 2:32am

One of my app instances died randomly, I am getting only these errors in the logs:

2022-07-28T02:18:20Z proxy[Na4Rxxb6] iad [error]Error: timed out while connecting to app
2022-07-28T02:18:20Z proxy[Na4Rxxb6] iad [error]Error: timed out while connecting to app
2022-07-28T02:18:35Z proxy[Na4Rxxb6] iad [error]Error: timed out while connecting to app
2022-07-28T02:18:37Z proxy[Na4Rxxb6] iad [error]Error: timed out while connecting to app
2022-07-28T02:18:49Z proxy[Na4Rxxb6] iad [error]Error: timed out while connecting to app
2022-07-28T02:19:02Z proxy[Na4Rxxb6] iad [error]Error: timed out while connecting to app

Sure enough if I check the status it says that the health check for the iad location is critical:

App
  Name     = my-example-app
  Owner    = my-example-org
  Version  = 12
  Status   = running
  Hostname = my-example-app.fly.dev

Instances
ID      	PROCESS	VERSION	REGION	DESIRED	STATUS 	HEALTH CHECKS      	RESTARTS	CREATED
Na4Rxxb6	app    	12     	iad   	run    	running	1 total, 1 critical	0       	2022-07-19T17:17:55Z
xoZ634s5	app    	12     	ewr   	run    	running	1 total, 1 passing 	0       	2022-07-01T19:56:05Z
CQslW63c	app    	12     	yyz   	run    	running	1 total, 1 passing 	0       	2022-06-08T18:16:55Z

And if I check the instance

Instance
  ID            = Na4Rxxb6
  Process       =
  Version       = 12
  Region        = iad
  Desired       = run
  Status        = running
  Health Checks = 1 total, 1 critical
  Restarts      = 0
  Created       = 2022-07-19T17:17:55Z

Recent Events
TIMESTAMP           	TYPE      	MESSAGE
2022-07-19T17:17:44Z	Received  	Task received by client
2022-07-19T17:18:28Z	Task Setup	Building Task Directory
2022-07-19T17:19:06Z	Started   	Task started by client

Checks
ID                              	SERVICE	STATE   	OUTPUT
219dd48c285e84611f2e717kj	tcp-443	critical	dial tcp 172.19.36.162:443: i/o timeout

This seems very similar to Any thoughts on why my app randomly died? - #2 by greg but this app is only running nginx.

The only thing I am wondering: Why did the app not automatically restart upon having a failed instance. But it seems that the default for restart_limit is 0, which means it never restarts. So I will be adding that to my config files, still seems a bit of an odd choice as my developer expectation would be that a failed instance restarts automatically.

jsierles · July 28, 2022, 7:07am

Is there anything in fly logs? You can also use fly ssh console -s to log in to the specific instance to inspect it.

jascha · July 28, 2022, 7:33am

Unfortunately not as the logs don’t go far enough back and the container itself just logs to /dev/stdout. A good reminder that I have to setup persistent logging for debugging these cases (would love to see fly.io providing a solution for that build into the platform).

It’s the first time a nginx container has crashed in 6+ months, so I don’t really expect it to happen again. If it does I have change dthe restart_limit to 5 which hopefully restarts the container (this was as well how I solved it). I just think 0 is not really a sane default for restart_limit but maybe I am missing something.

jsierles · July 28, 2022, 11:52am

We’ll take a look at the host the VM was on. Did you redeploy the app?

jascha · July 28, 2022, 12:23pm

Yes, I just restarted it via fly restart.

I can send you the exact app name and instance IDs via email if you like

jascha · July 31, 2022, 8:16am

@jsierles one thing that still trips me up thinking about this:

Why does the fly load balancer continues to send traffic to an unhealthy app?

I kind of thought that having my app being run in multiple locations safeguards me as well against a single instance failing. Or is this not how it’s supposed to work?

jsierles · July 31, 2022, 3:45pm

The errors you see are the health checks timing out - not actual traffic. Does that make sense?

jascha · August 1, 2022, 2:36am

But the instances where being served traffic from the fly proxy despite not responding which lead to actual downtime.

jsierles · August 1, 2022, 10:28am

OK - that’s certainly not the intended behavior. I’ll ask around about this and get back to you.

rahmatjunaid · August 5, 2022, 4:18pm

Hi @jascha just checking in are you still seeing this error message with your app?

jascha · August 5, 2022, 5:51pm

Hey @rahmatjunaid, thanks for asking. I am of course no longer seeing it, this is a production app and any kind of downtime is really bad. That’s why I would like to understand why the fly load balancer kept sending traffic to an unhealthy instance instead of rerouting to a healthy one.

Topic		Replies	Views
Bug / inconsistent state: Errors from dead app	7	283	January 23, 2023
Any thoughts on why my app randomly died?	4	692	March 31, 2022
Connections to app started hanging	11	319	July 12, 2022
Dead app: Data Transfer In has stopped	8	291	March 11, 2022
FLy status shows up but app is down for seven hours	9	838	March 21, 2023

App down

Related topics