Hey Fly friends. We have a bit of a mystery here and I could really benefit from your insights once again.
I have a low intensity but highly important service on Fly.io that has been seeing sporadic timeouts, as measured by updown.io, my own health checkers, and actual user complaints. I’ve tried a bunch of things over the past month to try to narrow in on the issue and I think it may be something in Fly.io’s routing of requests, but my confidence isn’t super high. I’m obviously fallible and my system may be causing this, but I think it’s time for me to ask for your input here.
A recent occurrence occurred at 11:02pm pacific (abt two hrs ago) with a request for /?health_check=True
One clue: it seems that at least some of the “timed out” calls are actually making it to my instances, but it appears the caller never gets a status code / closed connection back. My hint here is that I’m seeing multiple records for the same web hook id… Meaning the connecting service is retrying, presumably because it thought there was a failure, but my instance happily recorded a success and “did the right thing” in that case.
The service: nikola-receipt-receiver
what I’ve tried with no apparent effect:
- checking the same service on http instead of https (still has failures)
- doing an “early return” in the code for the service to ensure I wasn’t just using lots of runtime running an operation. (timeouts continued)
- increasing instance size
- increasing number of instances
- changing regions, adding regions, removing regions
- adding RAM
- allocating an IP address
attached: the report from updown.io. it looks like something happened late Feb.
Thank you for your help.