App down

One of my app instances died randomly, I am getting only these errors in the logs:

2022-07-28T02:18:20Z proxy[Na4Rxxb6] iad [error]Error: timed out while connecting to app
2022-07-28T02:18:20Z proxy[Na4Rxxb6] iad [error]Error: timed out while connecting to app
2022-07-28T02:18:35Z proxy[Na4Rxxb6] iad [error]Error: timed out while connecting to app
2022-07-28T02:18:37Z proxy[Na4Rxxb6] iad [error]Error: timed out while connecting to app
2022-07-28T02:18:49Z proxy[Na4Rxxb6] iad [error]Error: timed out while connecting to app
2022-07-28T02:19:02Z proxy[Na4Rxxb6] iad [error]Error: timed out while connecting to app

Sure enough if I check the status it says that the health check for the iad location is critical:

App
  Name     = my-example-app
  Owner    = my-example-org
  Version  = 12
  Status   = running
  Hostname = my-example-app.fly.dev

Instances
ID      	PROCESS	VERSION	REGION	DESIRED	STATUS 	HEALTH CHECKS      	RESTARTS	CREATED
Na4Rxxb6	app    	12     	iad   	run    	running	1 total, 1 critical	0       	2022-07-19T17:17:55Z
xoZ634s5	app    	12     	ewr   	run    	running	1 total, 1 passing 	0       	2022-07-01T19:56:05Z
CQslW63c	app    	12     	yyz   	run    	running	1 total, 1 passing 	0       	2022-06-08T18:16:55Z

And if I check the instance

Instance
  ID            = Na4Rxxb6
  Process       =
  Version       = 12
  Region        = iad
  Desired       = run
  Status        = running
  Health Checks = 1 total, 1 critical
  Restarts      = 0
  Created       = 2022-07-19T17:17:55Z

Recent Events
TIMESTAMP           	TYPE      	MESSAGE
2022-07-19T17:17:44Z	Received  	Task received by client
2022-07-19T17:18:28Z	Task Setup	Building Task Directory
2022-07-19T17:19:06Z	Started   	Task started by client

Checks
ID                              	SERVICE	STATE   	OUTPUT
219dd48c285e84611f2e717kj	tcp-443	critical	dial tcp 172.19.36.162:443: i/o timeout

This seems very similar to Any thoughts on why my app randomly died? - #2 by greg but this app is only running nginx.

The only thing I am wondering: Why did the app not automatically restart upon having a failed instance. But it seems that the default for restart_limit is 0, which means it never restarts. So I will be adding that to my config files, still seems a bit of an odd choice as my developer expectation would be that a failed instance restarts automatically.

Is there anything in fly logs? You can also use fly ssh console -s to log in to the specific instance to inspect it.

Unfortunately not as the logs don’t go far enough back and the container itself just logs to /dev/stdout. A good reminder that I have to setup persistent logging for debugging these cases (would love to see fly.io providing a solution for that build into the platform).

It’s the first time a nginx container has crashed in 6+ months, so I don’t really expect it to happen again. If it does I have change dthe restart_limit to 5 which hopefully restarts the container (this was as well how I solved it). I just think 0 is not really a sane default for restart_limit but maybe I am missing something.

We’ll take a look at the host the VM was on. Did you redeploy the app?

Yes, I just restarted it via fly restart.

I can send you the exact app name and instance IDs via email if you like

@jsierles one thing that still trips me up thinking about this:

Why does the fly load balancer continues to send traffic to an unhealthy app?

I kind of thought that having my app being run in multiple locations safeguards me as well against a single instance failing. Or is this not how it’s supposed to work?

The errors you see are the health checks timing out - not actual traffic. Does that make sense?

But the instances where being served traffic from the fly proxy despite not responding which lead to actual downtime.

OK - that’s certainly not the intended behavior. I’ll ask around about this and get back to you.

Hi @jascha just checking in are you still seeing this error message with your app?

Hey @rahmatjunaid, thanks for asking. I am of course no longer seeing it, this is a production app and any kind of downtime is really bad. That’s why I would like to understand why the fly load balancer kept sending traffic to an unhealthy instance instead of rerouting to a healthy one.