odd timeouts for a service

davidhodge · March 26, 2022, 8:22am

Hey Fly friends. We have a bit of a mystery here and I could really benefit from your insights once again.

I have a low intensity but highly important service on Fly.io that has been seeing sporadic timeouts, as measured by updown.io, my own health checkers, and actual user complaints. I’ve tried a bunch of things over the past month to try to narrow in on the issue and I think it may be something in Fly.io’s routing of requests, but my confidence isn’t super high. I’m obviously fallible and my system may be causing this, but I think it’s time for me to ask for your input here.

A recent occurrence occurred at 11:02pm pacific (abt two hrs ago) with a request for /?health_check=True

One clue: it seems that at least some of the “timed out” calls are actually making it to my instances, but it appears the caller never gets a status code / closed connection back. My hint here is that I’m seeing multiple records for the same web hook id… Meaning the connecting service is retrying, presumably because it thought there was a failure, but my instance happily recorded a success and “did the right thing” in that case.

The service: nikola-receipt-receiver

what I’ve tried with no apparent effect:

checking the same service on http instead of https (still has failures)
doing an “early return” in the code for the service to ensure I wasn’t just using lots of runtime running an operation. (timeouts continued)
increasing instance size
increasing number of instances
changing regions, adding regions, removing regions
adding RAM
allocating an IP address

attached: the report from updown.io. it looks like something happened late Feb.

ED168E10-7151-42EE-BD16-5B097508B773-769-0000CCA80821DF58

Thank you for your help.

kurt · March 26, 2022, 4:15pm

I see a couple of errors in our proxy logs that might be related. You can see these in your logs as well, but you may need to export logs to something like LogDNA to find them if your logs are chatty:

error from user's HttpBody stream: error reading a body from connection: end of file before message length reached

connection closed before message completed (code: AppConnectionClosed)

connection error: timed out (code: AppConnectionIdle)

The first means the response was truncated. The second means the app closed the connection sending a full response. And the third means the connection to the app timed out because it hadn’t received any data in 60s.

It might be worth hooking up something like Honeycomb. You can grab the Fly-Request-Id from the request we send and use it to create Honeycomb traces. This will let you see very specifically where a request is failing (or slow).

davidhodge · March 26, 2022, 4:45pm

Thanks Kurt. We’ll dig here!

davidhodge · March 26, 2022, 11:15pm

hey Kurt,

We’ve added logging of the Fly-Request-Id in our handler and are now looking into logging so we can see the errors you mentioned from Fly. Presently we use papertrail for our application logs and already ship them over directly from Python, but we don’t yet ship over logs from Fly itself.

Do you happen to know if the linked project here ships Fly’s logs? Fly logs to papertrail/logdna

Additionally, we’ll look into also using Honeycomb for tracing.

Topic		Replies	Views
599s from Fly to Fly this evening	6	439	August 25, 2020
Service unavailable? Unable to deploy django app or login	18	549	September 16, 2023
503/502 on all my apps since 10 min already Questions / Help	8	1009	November 2, 2022
Random periods of downtime	11	548	March 30, 2022
SSL Connection Issues after a deployment	21	800	April 13, 2021

odd timeouts for a service

Related topics