Debugging 502 responses

:wave: Howdy, my app is consistently receiving 2-4 req/sec, and while most are handled successfully, ~5-10 per minute result in unexpected 502 responses.

These 502s show up in the fly metrics dashboard for my app and I can see them in the instrumentation for the client connecting to my app, but I can see no explanation for the 502s in the app itself or its logs.

At first, I wondered if these were cases where the fly global load balancer was aborting the request after some timeout, but most of the 502s take less time (1-3 seconds) than many successful requests (up to 10 seconds, the app’s own timeout).

Here’s an example 502 response, from the client’s logs, that includes relevant fly request and instance ID headers, if it would help debug the issue:

[E 210415 19:21:16 qr_urls:90] error: status=502 body='' headers={'date': 'Thu, 15 Apr 2021 19:21:16 GMT', 'fly-request-id': '01F3BFNWD3N58WKSF8F36WNA58', 'content-length': '0', 'via': '1.1 fly.io', 'server': 'Fly/86dfcb7 (2021-04-12)'} elapsed_seconds=3.06265

Does anyone have any tips for how to debug these failed requests?

We’re not currently exposing the reasons for these errors. Some of these are caused by Fly and are often due to unhandleable network issues.

I looked up this specific request ID and found out that the error was due to an “incomplete” message. That is, there was a race condition between the use (or reuse) of a connection and the closing of the connection.

I’m currently checking if this has anything to do with connections between our own servers and not between us and your app.

We should expose this in the logs and the metrics soon!

After investigating, this error came from between the host and your instance. It appears the TCP connection was closed before the request / response was done.

What kind of app are you running? Is it possible it might be closing connections too quickly under certain conditions?

Aha, I think I might know the issue: the app sets a relatively tight deadline for writing the response to the client, so it might close the connection before the client has finished reading it. Let me try adjusting that timeout to see if it has any impact on these 502s. (I should have thought of this before posting here, I’m sorry for the noise if this turns out to be the problem!)

The app is written in Go and “resolves” shortened URLs, primarily from social media.

Every bug ever is either DNS or a timeout. :smiley:

This was definitely the problem. Can you guess where I started tuning server timeouts (the non-502 errors are expected here)?

Thanks for the help figuring this out! Though this was definitely my own fault, it would be nice if Fly could somehow surface these kinds of errors to the applications (either in the logs or in metrics, somehow).

I’m loving the product so far, keep up the good work!

Thank you!

That’s a good change :slight_smile:

Adding logs for this is definitely in the plans. Lots of low-hanging fruit in the logging department.