I have one of my Fly services running a health checker on all my other fly services. This evening starting at about 10:51pm it starting getting a bunch of 599s and I got paged. These are multiple separate and mostly independent services. Though they do share some code.
At the same time, from what I can tell, a mostly identical health checker outside of Fly (Digital Ocean) was not getting 599s. And also updown.io reported no outages / 599s during this period on these endpoints.
Is there anything I should take away from this? / Can I help debug? Thanks!
Edit: originally I thought the issue stopped at 11:10pm, but it appears to still be an issue as of now (11:26 pacific)
The plot thickens! The health checker basically has a list of URLs where it will get / or post on some frequency based on a configuration. If it gets 200s, then all is well obviously. When it gets any 400s 500s it’ll page me.
The servers are running Python / Tornado. And upon looking a bit at the logs, it looks like perhaps the 599 is reported by the tornado networking library on the HealthCheck server after a timeout and isn’t explicitly returned by Fly. This is the error I’m seeing: “tornado.simple_httpclient.HTTPTimeoutError: Timeout during request”
My interpretation right now is there’s some sort of networking issue that is causing timeouts. I’ll also look into there being an application error on my end, though the simultaneity of these breaking and un-breaking suggests that may not be the cause.
Okay so there is a theoretical issue where too many simultaneous outgoing http requests could cause queueing of requests and then 599s with tornado. And this would be more likely to happen on Fly then my other systems because of the constrained instance size I picked. That said, the load on the instance seemed really low, so this doesn’t seem particularly likely.
I’ve upped the simultaneous connection limit and re-deployed. Given that this issue only affected Fly health checks, didn’t affect my other health checks, and didn’t affect the health of my system overall, I’d be fine running this for a few days and reporting back. I do still suspect it’s not an issue on my end, but I’m also fine waiting a bit on the investigation because of the low impact at present.
Have you seen anything weird in the last few days? When things hit connection limits (internal to the VM), it doesn’t always correlate to extra memory or CPU usage. It’s one of those things you end up needing internal metrics to really nail down if it happens over and over.