Error: Client network socket disconnected before secure TLS connection was established

Hi,

We’ve migrated our app from heroku to fly.io, since then, our only client is having network errors a dozen of times per day when doing http requests toward our app hosted on fly.io.

Here is the error that happens on their side when they make a request toward our app:

    Error: Client network socket disconnected before secure TLS connection was established
        at connResetException (internal/errors.js:607:14)
        at TLSSocket.onConnectEnd (_tls_wrap.js:1544:19)
        at TLSSocket.emit (events.js:327:22)
        at TLSSocket.EventEmitter.emit (domain.js:467:12)
        at endReadableNT (internal/streams/readable.js:1327:12)
        at processTicksAndRejections (internal/process/task_queues.js:80:21) {
      code: 'ECONNRESET',
      path: null,
      host: 'api.hootify.io',
      port: 443,
      localAddress: undefined,
      config: {
        url: '/private/notifications',
        method: 'post',
        data: '{"id": [TRUNCATED]}',
        headers: {
          Accept: 'application/json, text/plain, */*',
          'Content-Type': 'application/json;charset=utf-8',
          Authorization: 'Bearer [REMOVED]',
          'User-Agent': 'axios/0.21.1',
          'Content-Length': 384
        },

Client Details
The client app is hosted on AWS (running multiple instances with Elastic Beanstalk) and is written in node using axios to make http requests to ours. As said, most requests are going through to our app fly.io, but every now and then (~dozen times per day) some requests results in the error above.
Also, it does about 100-120 requests per minute toward our app.

Fly.io App Details
Our app is running two instances in fly.io in Europe (ams and fra). It is written in node and uses express as the server to answer http requests.

Can someone please help with this? Is there any network hiccups on fly.io side? How can we debug what is going wrong?

Also note that we are using the log sink (NATS and Vector) provided by fly.io to send logs from our apps to LogDNA, and that one also produces errors of interrupted connections in our app logs:

Feb 2 11:16:19 6b399e48 vector WARN sink{component_kind="sink" component_id=logdna component_type=logdna component_name=logdna}:request{request_id=36262}: vector::sinks::util::retries: Retrying after error. error=Failed to make HTTP(S) request: connection closed before message completed 
Feb 2 11:57:15 6b399e48 vector WARN sink{component_kind="sink" component_id=logdna component_type=logdna component_name=logdna}:request{request_id=36514}: vector::sinks::util::retries: Retrying after error. error=Failed to make HTTP(S) request: connection error: Connection reset by peer (os error 104) 

Curiously the amount of errors has lowered these last few days to a handful a day (instead of a dozen).

Did something change?

I’d still like to get rid of these errors completely.

FRA has had some capacity issues over the last week. That should be improving daily, which may explain the fix. Would you be able to run exclusively in AMS for a while to see if that’s related?

The LogDNA errors probably come from LogDNA’s side (the ‘peer’). I’m not sure why they happen, but it might be worth contacting LogDNA support about it.

I’m not sure that it is related to last week outage in FRA as I’ve been having these errors for months.

I’ve moved services out of FRA for now (on AMS, with LHR and CDG as backups), I’ll monitor how it goes.

Is this going through IPv6?

We fixed an issue with a few misconfigured hosts, in FRA and AMS, related to IPv6 this morning.

I don’t know if the request is routed through IPv6 or not, I could check.

But I noticed that the amount of errors went from a dozen per day to only a couple which is a great improvement.

Still curious about the last couple of errors I have per day though.

Hey there!

If you can curl https://debug.fly.dev from your client’s app (the one making requests to your app), we could determine which region they’re hitting. A traceroute could also do the same.

This could be a network issue between AWS and us where there’s packet loss causing handshakes to be slower than they should be and triggering a deadline on our end.

Perhaps this is a problem with how axios interacts with our proxy. I’m seeing a few errors related to a deadline we have for parsing a ClientHello message. The client has 2 seconds to start sending any data when handshaking TLS.

Do you have more details about your client’s app? Does it only happen when trying to reach your app or does this happen for other sites?

If this is a node.js app running a lot of event loop task, that event loop can become “clogged” and may delay sending data through a connection.

It’s possible this is entirely our fault. Maybe this deadline we have is too severe or something else is going on :thinking:

For what it’s worth, we’re constantly monitoring multiple apps on our platform from all AWS regions and we get notified if anything goes wrong (including TLS handshakes, connection and request timeouts).

Thanks for the input, I’ve tried to run a curl from the app running in AWS but there were some issues remoting into that instance. I’ll try to spend more time on it in a few days.

Regarding a possible axios or nodejs event loop issue, although plausible, I find it unlikely as the app running in AWS is making many more requests with axios to other services and apps and those do not have any of these TLS connection errors.

I’ve removed the IPv6 with flyctl ips release XXX and I haven’t had any errors for nearly 20 hours, but I just saw a new one.

❯ flyctl ips list
TYPE ADDRESS        REGION CREATED AT           
v4   123.123.123.123 global 2021-11-08T...