App suddenly getting 599s

@jerome sorry for my delay here. I moved on for a bit when the issue went away! So as far as I know there shouldn’t be a service I’m using that would be taking a long time to do handshakes or that would be aborting them.

  1. The majority of traffic will come from a hand-rolled health checker with python/tornado. If the above is happening this seems the most likely culprit.
  2. updown.io
  3. The actual app. While this is the primary “use-case” and the service itself is critical, the actual traffic from real users is quite low.

If you have any ideas on how to investigate, especially #1 I’d be curious to take a look.

Is it possible to see the source for the hand-rolled health checker? Or perhaps describe what it does specifically? If it doesn’t kept the connection around without completing the TLS handshake, that might’ve caused this issue.

We’ve used updown.io in the past and it’s never caused these issues so I’m assuming this is fine.

I think the way we were spawning the asynchronous task was at fault here. I learned there was no way to know if the operation had timed out (despite having logic to that effect). We’ve refactored this whole bit to make it detectable and it should now be fine. We have further optimizations to make concerning TLS handshakes which should also help.

Sure, I’ll make a slimmed down version and send it over. What’s the best email to use for you?

support @ our domain works :slight_smile: you might get an automated response but we’ll see it!