sum by (region) (rate(fly_app_http_responses_count[$__interval]))
The traffic seems to be automatically routed to other regions so the only effect on users is a slower response time. However this is not ideal especially if we are paying for the machine in that region.
When looking at the status of the app everything looks fine:
Deployment Status
ID = ...
Version = v272
Status = successful
Description = Deployment completed successfully
Instances = 3 desired, 3 placed, 3 healthy, 0 unhealthy
Instances
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
827d081b app 272 lhr run running 2 total, 2 passing 0 1h54m ago
ea577808 app 272 syd run running 2 total, 2 passing 0 1h55m ago
b545f3a1 app 272 lax run running 2 total, 2 passing 0 1h56m ago
Can someone explain to me how the routing works? I can find very little information about it in the docs.
Since it looks like you have 1 vm per-region, this behavior could also be caused by a problem with individual instances. At first glance, I’m not seeing any any errors with your app’s traffic as it passes through our infra.
You might have already done this, but you can check for restarts with fly status --all and investigate instance logs with fly logs -i. This information might help you narrow things down further!
What are the circumstances that would lead to no more traffic being routed to a region?
Quite a few! Since it sounds like your instances are healthy, you may want to check if your have a hard_limit defined. If this value is exceeded, then that instance would no longer accept traffic. With one instance per-region, this would effectively re-route traffic to a different region.
This should show up in your app’s logs, though. You could rule this out by deploying a second instance in the region where you aren’t seeing any traffic.
I have seen some hard limits being hit but I was assuming that the hard_limit would just throttle the number of requests going to that dyno and not stop sending any traffic to it.
Btw. we have switched from balanced autoscaling to standard and since then I have not seen the issue again.