Elevated reports of Connection reset by peer (os error 104) / end of file before message length reached

I’ve started seeing a number of elevated error rates in the last hour. We don’t have machines in the jnb region (so I’d guess it’s one of Fly’s internal proxies)

[ERROR] jnb 3ce4 9080577c024128 reference-structuring--v2--production could not complete HTTP request to instance: error from user's HttpBody stream: error reading a body from connection: Connection reset by peer (os error 104)
[ERROR] lhr d88f 9080577c024128 reference-structuring--v2--production could not complete HTTP request to instance: error from user's HttpBody stream: error reading a body from connection: end of file before message length reached


Are there any routing issues going on similar to last time?

We’re now seeing these errors in other apps under the same org too, and some new messages (we haven’t deployed today)

"could not complete HTTP request to instance: error from user's HttpBody stream: error reading a body from connection: Connection reset by peer (os error 104)"
"could not complete HTTP request to instance: connection error: Connection timed out (os error 110)

Also facing this issue! Received over 3k “request aborted” errors on my NodeJS server over the last 3hours.

Same happend last month 11./12. of January. Explanation by support back then.

We had some routing trouble over the past few days, where many requests were mistakenly being sent to our Stockholm data center (even from the Americas)

Please look into it asap🙏

EDIT: Machines are located in FRA

On my end I’ve been getting very slow responses from Fly.io since this morning (region: LHR).

Even static assets (served by Fly, rather than my backend) are very slow, it takes 1 second to download lightweight (1.5KB) SVGs. This contrasts a lot with the usual speed of the app, which is normally very fast.

The app normally gets a core of 100 on Google’s PageSpeed Insights tool, but today the performance is clearly degraded as the “Largest Contentful Paint” takes about 3 seonds (it’s less than 1 usually).

@olivierphi

Can you share the output of curl http://debug.fly.dev and mtr debug.fly.dev please?

Sure!
I could see nothing relevant in curl’s output, but mtr does seem to show network issues around London:
(sorry, I couldn’t find a way to apply code formatting so it’s a bit hard to read :pensive:)

Start: 2024-02-26T15:14:49+0000
HOST: xxxxx              Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- _gateway                   0.0%    10    6.1   4.3   3.0   6.1   1.1
  2.|-- 185.7.230.118              0.0%    10   12.5  13.8   9.7  20.4   3.6
  3.|-- ae0.an3.sgl-edi.fluency.n  0.0%    10    8.8   7.0   4.6  10.3   1.8
  4.|-- ae1.an1.sgl-edi.fluency.n  0.0%    10    7.0  15.1   5.9  59.4  17.9
  5.|-- ae7.cr1.sgl-edi.fluency.n  0.0%    10    4.2   6.3   4.2  11.4   2.6
  6.|-- ae0.cr2.sgl-edi.fluency.n  0.0%    10    4.6   5.5   4.2   9.2   1.6
  7.|-- ae2.cr2.for-gla.fluency.n  0.0%    10    4.7   5.6   4.1   7.4   1.1
  8.|-- ???                       100.0    10    0.0   0.0   0.0   0.0   0.0
Start: 2024-02-26T15:15:05+0000
HOST: xxxxx             Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- _gateway                   0.0%    10    2.9   3.9   2.2   7.3   1.5
  2.|-- 185.7.230.118              0.0%    10   11.2  13.1   7.4  19.4   3.1
  3.|-- ae0.an3.sgl-edi.fluency.n  0.0%    10    8.3  19.1   5.7 120.1  35.5
  4.|-- ae1.an1.sgl-edi.fluency.n  0.0%    10    7.2   8.3   4.4  14.8   3.4
  5.|-- ae7.cr1.sgl-edi.fluency.n  0.0%    10    4.4   6.4   4.0  11.4   2.6
  6.|-- ae0.cr2.sgl-edi.fluency.n  0.0%    10    6.2   5.1   3.5   6.9   1.2
  7.|-- ae2.cr2.for-gla.fluency.n  0.0%    10    4.8   6.2   3.6  11.1   2.3
  8.|-- et-0-2-0.cr.for-gla.fluen  0.0%    10    8.3   5.7   3.8   8.3   1.6
  9.|-- ae0.cr2.kil-man.fluency.n  0.0%    10    8.9   9.3   8.2  10.8   0.9
 10.|-- ae1.cr1.kil-man.fluency.n  0.0%    10   12.8  10.5   8.0  14.8   2.1
 11.|-- ae2-112.cr1-man1.ip4.gtt.  0.0%    10   11.9   9.7   7.6  12.9   1.9
 12.|-- ae17.cr11-lon2.ip4.gtt.ne 60.0%    10   15.3  19.0  14.5  30.4   7.6   <======
 13.|-- ip4.gtt.net                0.0%    10   20.0  24.3  19.2  46.5   8.2
 14.|-- ae-7.r20.londen12.uk.bb.g  0.0%    10   18.7  21.2  18.3  26.0   2.9
 15.|-- ae-0.a02.londen12.uk.bb.g  0.0%    10   23.0  22.1  19.2  33.3   4.4
 16.|-- ???                       100.0    10    0.0   0.0   0.0   0.0   0.0
 17.|-- ???                       100.0    10    0.0   0.0   0.0   0.0   0.0
 18.|-- 43.232.40.185.ptr.anycast  0.0%    10   14.0  15.2  13.2  18.4   1.8
 19.|-- 37.16.16.7                 0.0%    10   12.8  13.5  12.6  14.6   0.7

Hi @olivierphi ,

To apply code formatting, wrap the text in triple quotes in a single line! “```”

for curl it’s typically useful to use debug, this will tell you (and us!) which edge handled your connection.

curl -I -H "flyio-debug: doit" https://debug.fly.dev

the output should be safe to share here; exception, the ra item in the resulting json dict in the flyio-debug header, this contains your IP address. Feel free to obfuscate that if needed.

  • Daniel

Here you go! :slightly_smiling_face:

HTTP/2 200 
fly-region: lhr
remote-addr: 172.16.132.106:33500
date: Mon, 26 Feb 2024 16:23:44 GMT
content-length: 552
content-type: text/plain; charset=utf-8
server: Fly/17d0263d (2024-02-15)
via: 2 fly.io
flyio-debug: {"n":"edge-nac-lhr1-20ac","nr":"lhr","ra":"xxx.xxx.xxx.xxx","rf":"Verbatim","sr":"lhr","sdc":"lon1","sid":"4d89646f914587","st":0,"nrtt":0,"bn":"worker-cf-lon1-5243"}
fly-request-id: 01HQK3NE3DVAKMBRDBFRDSNCSW-lhr

Hey, we’re also running into some connection issues. We’re proxying through Vercel so unfortunately can’t get a ton of debug information ATM, but we’ve been getting thrown ROUTER_EXTERNAL_TARGET_ERROR while trying to proxy to our Fly instance.

I strangely don’t see any errors in our Fly logs though, so might be something else or I might be looking for errors in the wrong place.

My debug curl if it’s helpful:

HTTP/2 200
fly-region: sjc
remote-addr: 172.16.0.2:53038
date: Mon, 26 Feb 2024 16:54:42 GMT
content-length: 618
content-type: text/plain; charset=utf-8
server: Fly/17d0263d (2024-02-15)
via: 2 fly.io
flyio-debug: {"n":"edge-nac-sjc1-6443","nr":"sjc","ra":"xxx","rf":"Verbatim","sr":"sjc","sdc":"sv15","sid":"0e2866e35be867","st":0,"nrtt":1,"bn":"worker-pkt-sv15-c52a"}
fly-request-id: 01HQK5E4Z2NB74ABS565T3P7R9-sjc

I don’t know if it’s something you did on your end or if it was something fixed somewhere in the network between me and the LHR datacenter, but… It’s fixed now! :balloon:

I would appreciate a post mortem on this one. If my production app was in use today, it would be broken and unable to be used this morning, because all stripe webhooks were failing.

Really worried about going live and having something like this happen, and then losing my first customers

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.