@jerome sorry for my delay here. I moved on for a bit when the issue went away! So as far as I know there shouldn’t be a service I’m using that would be taking a long time to do handshakes or that would be aborting them.
The majority of traffic will come from a hand-rolled health checker with python/tornado. If the above is happening this seems the most likely culprit.
Is it possible to see the source for the hand-rolled health checker? Or perhaps describe what it does specifically? If it doesn’t kept the connection around without completing the TLS handshake, that might’ve caused this issue.
We’ve used updown.io in the past and it’s never caused these issues so I’m assuming this is fine.
I think the way we were spawning the asynchronous task was at fault here. I learned there was no way to know if the operation had timed out (despite having logic to that effect). We’ve refactored this whole bit to make it detectable and it should now be fine. We have further optimizations to make concerning TLS handshakes which should also help.