599s from Fly to Fly this evening

Hey Fly friends,

I have one of my Fly services running a health checker on all my other fly services. This evening starting at about 10:51pm it starting getting a bunch of 599s and I got paged. These are multiple separate and mostly independent services. Though they do share some code.

At the same time, from what I can tell, a mostly identical health checker outside of Fly (Digital Ocean) was not getting 599s. And also updown.io reported no outages / 599s during this period on these endpoints.

Is there anything I should take away from this? / Can I help debug? Thanks!

Edit: originally I thought the issue stopped at 11:10pm, but it appears to still be an issue as of now (11:26 pacific)

Our proxy does not respond with a 599 status code, currently. We only respond with 502s and sometimes a 503.

This leads me to think the issue is somewhere outside of Fly’s logic.

Can you share more details? What does the health checker check, more precisely? Any other information that might help us help you :slight_smile:

The plot thickens! The health checker basically has a list of URLs where it will get / or post on some frequency based on a configuration. If it gets 200s, then all is well obviously. When it gets any 400s 500s it’ll page me.

The servers are running Python / Tornado. And upon looking a bit at the logs, it looks like perhaps the 599 is reported by the tornado networking library on the HealthCheck server after a timeout and isn’t explicitly returned by Fly. This is the error I’m seeing: “tornado.simple_httpclient.HTTPTimeoutError: Timeout during request”

Two of the fly-hosted URLs affected:

My interpretation right now is there’s some sort of networking issue that is causing timeouts. I’ll also look into there being an application error on my end, though the simultaneity of these breaking and un-breaking suggests that may not be the cause.

p.s. I have one more theory that this may be a tornado bug regarding connection limits. Before you get too deep digging here, let me check that!

Okay so there is a theoretical issue where too many simultaneous outgoing http requests could cause queueing of requests and then 599s with tornado. And this would be more likely to happen on Fly then my other systems because of the constrained instance size I picked. That said, the load on the instance seemed really low, so this doesn’t seem particularly likely.

I’ve upped the simultaneous connection limit and re-deployed. Given that this issue only affected Fly health checks, didn’t affect my other health checks, and didn’t affect the health of my system overall, I’d be fine running this for a few days and reporting back. I do still suspect it’s not an issue on my end, but I’m also fine waiting a bit on the investigation because of the low impact at present.

Have you seen anything weird in the last few days? When things hit connection limits (internal to the VM), it doesn’t always correlate to extra memory or CPU usage. It’s one of those things you end up needing internal metrics to really nail down if it happens over and over.

I’ve not seen anything in a few days. I’ll post back here if I see it again.