Hi, I’m currently seeing a significant increase in the number of requests to my app, and have scaled my app to 50 nodes to handle the throughput.
While scaling, I noticed the following:
ID REGION HEALTH CHECKS RESTARTS CREATED
c23a0196 mia running 2 total, 2 passing 30 11h56m ago
67276b05 mia running 2 total, 2 passing 20 11h56m ago
430508b3 bos running 2 total, 2 passing 0 11h45m ago
ccc4233e sin running 2 total, 2 passing 0 11h45m ago
17368be9 fra running 2 total, 2 passing 0 11h45m ago
...[a number of other instances all with 0 restarts]...
It seems nodes in the mia region are restarting way too often. This is a true micro-service, there is no state held at each node, so I don’t think it would be something application-related: the deployment should be identical to the other nodes, which are all running without any restarts.
Thanks for replying Kaz. I’ve deployed since then, not sure the status will match now. I’ll keep an eye on things and resurrect this thread if it happens again.
It happened again yesterday, only at the mia location, instance 1285de7f. I checked the logs but couldn’t really see anything.
The main concern is that, as I mentioned, this is a true micro-service. There is no state held at each node, there is no database, there aren’t even any cookies. Every single instance is the same, not only because it is the same docker image but from a runtime perspective too. The only difference I can see is that the mia region gets more requests overall than other US regions.
Looks like your instance 1285de7f was restarted because a health check failed. Unfortunately we don’t have much visibility into why health checks failed in the past , but looking at your app’s logs I see a lot of messages from our proxy:
could not make HTTP request to instance: connection closed before message completed
and
could not make HTTP request to instance: connection error: timed out
along with health check failures matching the timestamps of when the mia instance was restarted.
The proxy errors are from many regions, not just mia, but you did mention that you get more traffic in that region - is it possible your instances are failing to respond to some HTTP requests at high load?