Odd number of restarts in single region

Hi, I’m currently seeing a significant increase in the number of requests to my app, and have scaled my app to 50 nodes to handle the throughput.

While scaling, I noticed the following:

ID          REGION   HEALTH CHECKS                 RESTARTS    CREATED
c23a0196    mia      running 2 total, 2 passing    30          11h56m ago
67276b05    mia      running 2 total, 2 passing    20          11h56m ago
430508b3    bos      running 2 total, 2 passing    0           11h45m ago
ccc4233e    sin      running 2 total, 2 passing    0           11h45m ago
17368be9    fra      running 2 total, 2 passing    0           11h45m ago
...[a number of other instances all with 0 restarts]...

It seems nodes in the mia region are restarting way too often. This is a true micro-service, there is no state held at each node, so I don’t think it would be something application-related: the deployment should be identical to the other nodes, which are all running without any restarts.

Is this a known issue in the mia region?

Can you check fly status --all and fly logs? Seems your app’s healthcheck was failing multiple times.

Thanks for replying Kaz. I’ve deployed since then, not sure the status will match now. I’ll keep an eye on things and resurrect this thread if it happens again.

Hi Kaz

It happened again yesterday, only at the mia location, instance 1285de7f. I checked the logs but couldn’t really see anything.

The main concern is that, as I mentioned, this is a true micro-service. There is no state held at each node, there is no database, there aren’t even any cookies. Every single instance is the same, not only because it is the same docker image but from a runtime perspective too. The only difference I can see is that the mia region gets more requests overall than other US regions.

Nothing in the logs at all? If it’s load, it could be due to memory / OOM killer.

No, nothing that would indicate a reason for restarting. I’m going to keep an eye on it today to try and “catch it in the act”.

I don’t think it is an OOM issue, according to the metrics memory is pretty stable. Also, this only happens in the mia region.

Looks like your instance 1285de7f was restarted because a health check failed. Unfortunately we don’t have much visibility into why health checks failed in the past :confused: , but looking at your app’s logs I see a lot of messages from our proxy:

could not make HTTP request to instance: connection closed before message completed

and

could not make HTTP request to instance: connection error: timed out

along with health check failures matching the timestamps of when the mia instance was restarted.

The proxy errors are from many regions, not just mia, but you did mention that you get more traffic in that region - is it possible your instances are failing to respond to some HTTP requests at high load?

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.