Traffic not being routed to some regions

We have just started using fly for one of our apps. The site seems stable but every now and then traffic stops being routed to one region.


sum by (region) (rate(fly_app_http_responses_count[$__interval]))

The traffic seems to be automatically routed to other regions so the only effect on users is a slower response time. However this is not ideal especially if we are paying for the machine in that region.
When looking at the status of the app everything looks fine:

Deployment Status
  ID          = ...
  Version     = v272
  Status      = successful
  Description = Deployment completed successfully
  Instances   = 3 desired, 3 placed, 3 healthy, 0 unhealthy

Instances
ID      	PROCESS	VERSION	REGION	DESIRED	STATUS 	HEALTH CHECKS     	RESTARTS	CREATED
827d081b	app    	272    	lhr   	run    	running	2 total, 2 passing	0       	1h54m ago
ea577808	app    	272    	syd   	run    	running	2 total, 2 passing	0       	1h55m ago
b545f3a1	app    	272    	lax   	run    	running	2 total, 2 passing	0       	1h56m ago

Can someone explain to me how the routing works? I can find very little information about it in the docs.

Since it looks like you have 1 vm per-region, this behavior could also be caused by a problem with individual instances. At first glance, I’m not seeing any any errors with your app’s traffic as it passes through our infra.

You might have already done this, but you can check for restarts with fly status --all and investigate instance logs with fly logs -i. This information might help you narrow things down further!

Thanks for looking into this.
I can’t see anything wrong in the logs and there was also no restarts for any of the instances.

What are the circumstances that would lead to no more traffic being routed to a region?

What are the circumstances that would lead to no more traffic being routed to a region?

Quite a few! Since it sounds like your instances are healthy, you may want to check if your have a hard_limit defined. If this value is exceeded, then that instance would no longer accept traffic. With one instance per-region, this would effectively re-route traffic to a different region.

This should show up in your app’s logs, though. You could rule this out by deploying a second instance in the region where you aren’t seeing any traffic.

Thanks this is helpful.

I have seen some hard limits being hit but I was assuming that the hard_limit would just throttle the number of requests going to that dyno and not stop sending any traffic to it.

Btw. we have switched from balanced autoscaling to standard and since then I have not seen the issue again.

1 Like