Connection reset by peer, errno 104

We have an internal tool that we recently migrated to Fly.io from Digital Ocean. Among other things, it fetches assets from various different websites.

Recently, I’ve noticed that it’s unable to fetch assets from a subset of websites and instead errors out with a Connection reset by peer or timeout messages, even though the websites are accessible locally as well as from our previous DO servers (never any issues).

So I went digging in:


1. Request with curl

I started by running fly ssh console and making a simple curl request. It failed with the following error:

curl -v https://www.ascentvictorypark.com/
*   Trying 198.190.14.13:443...
* Connected to www.ascentvictorypark.com (198.190.14.13) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server did not agree to a protocol
* Server certificate:
*  subject: CN=ascentvictorypark.com
*  start date: Mar 29 09:00:12 2022 GMT
*  expire date: Jun 27 09:00:11 2022 GMT
*  subjectAltName: host "www.ascentvictorypark.com" matched cert's "www.ascentvictorypark.com"
*  issuer: C=US; O=Let's Encrypt; CN=R3
*  SSL certificate verify ok.
> GET / HTTP/1.1
> Host: www.ascentvictorypark.com
> User-Agent: curl/7.79.1
> Accept: */*
>
* OpenSSL SSL_read: Connection reset by peer, errno 104
* Closing connection 0
curl: (56) OpenSSL SSL_read: Connection reset by peer, errno 104

2. Accessible by other Services (DigitalOcean)

I ran the same command locally and it succeeded, and the website is also accessible directly.

I wasn’t sure what was going on at this point and wanted to isolate the issue. My thought was maybe this issue was limited to the networking at Fly.io and wanted to confirm. I SSH’ed back into the old DigitalOcean instance for the app and made the same curl request which succeeded without issues.


3. Accessible by other Apps

Next step: isolate the issue further. I have a few other apps deployed to Fly.io in various regions throughout the world, and I SSH’ed into a few of them and ran the same command and all of them succeeded as well.


4. Changing IPs & Regions

This led me to believe the issue was isolated to my current app instance only. Maybe the server/IP somehow landed on the blacklist/firewall of every single website at the same. It could happen.

So changing the IPs and regions would solve the problem, right? Wrong.

I released old IPs and assigned new ones (both v4 and v6), changed the region multiple times, and restarted the app. But the curl command always returned the same error.


5. Deployed a new App

I then launched a brand new app on Fly.io in a completely different region (but with the same code + Dockerfile), and retried the curl command and it failed again with the same error.


So something very weird is going on here. Could this be an issue with my Dockerfile? Seems unlikely though.

Why are requests to some websites successful for some apps but not for others?

Interesting-- thank you for the detailed and organized write-up of what you’ve tried so far! It does seem to me like you’ve isolated the issue to the codebase/Dockerfile (presuming that these were used on your DO instances).

As you’re probably aware, there are a few idiosyncrasies to using Dockerfiles on our platform. I don’t have many specific ideas off the bat about what might be causing only a subset of websites to drop the connection only from this particular app.

If you’re comfortable sharing your Dockerfile or fly.toml, someone here might be able to offer more specific advice :slight_smile:

6. MCVE with Dockerfile

As suspected, the issue was in fact with my Dockerfile. Took me over 40+ deploys on existing and new apps, lots of hit and trial, and occasional head-scratching but I was able to create a MCVE of the issue.

I deployed a simple elixir script that serves web requests, with this minimal Fly config:

app = "urbanave-test"

kill_signal = "SIGTERM"
kill_timeout = 5

and this Dockerfile:

ARG BUILDER_IMAGE="hexpm/elixir:1.12.0-erlang-23.3.4.14-alpine-3.15.3"
FROM ${BUILDER_IMAGE} as builder

WORKDIR /app
COPY app.exs app.exs

RUN mix local.hex --force && mix local.rebar --force
RUN elixir app.exs

RUN apk add curl

CMD ["elixir", "app.exs", "server"]

So this is probably an issue with the Alpine image I’m using or the curl package that’s installed.

Note: Edited to further simplify and reduce the scope of MCVE

7. Maybe something with TLS v1.2?

While trying to reduce the issue to an MVCE above, I was also trying to find similarities between all the websites that were dropping the connection. I made manual cURL requests to about 30 of the websites (both that dropped the connection and returned the response as expected).

curl -v $WEBSITE_URL -o out.html

There was one line in the verbose output, that was common in all the websites that returned the error:

# ... other verbose output
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
# ... other verbose output
* OpenSSL SSL_read: Connection reset by peer, errno 104
* Closing connection 0
curl: (56) OpenSSL SSL_read: Connection reset by peer, errno 104

Other websites using TLSv1.2 worked without issues but all websites with both TLSv1.2 and the ECDHE-RSA-AES128-GCM-SHA256 cipher always returned the error.

I’m not familiar with the workings of SSL/TLS/OpenSSL so I’m not sure if this is just my pattern-seeking monkey brain or if something is actually there.

8. Other apps no longer working

I’ve been trying various different base images, including both ubuntu and alpine, but I’ve been getting the same error for all of them.

During this time I had to restart another one of my apps where the command was previously working. But when I ran the curl command again after the restart, it started giving the exact same error.

Now I’m completely stumped. I’ve no idea what’s going on. The command is still working on another one of my apps, and I’m going to avoid deploying/restarting it for now so I can maybe compare what else is different between the two.

So it does look like the issue is at least partially related to Fly’s networking. Because if I build the docker image locally and run the command, it works without issues.

I also deployed the same Dockerfile above to two different regions on Fly. The command works on the ord region but fails for the sin region. On the other hand, the command fails in both regions for a different app with slightly different Dockerfiles.

Changing your app’s IP assignments won’t help here because all outgoing connections use our server’s IPs, not your app’s. That is, until we figure out a better way.

We’ve been looking at this from our end.

Testing

Yesterday, I ran the curl from each of our servers and it failed to get a response within 5 seconds on ~6 servers. One of which is hosting your app instance.

Now, I ran it this morning and it’s only failing on 1 host.

I also ran it through an external service with many locations and it seems to consistently fail from 1 location. I only thought about it this morning, I wish I had tried yesterday.

https://wheresitup.com/demo/results/629b8a13d4a4ec4dfb0f3461
^ here, it only fails in Scranton. I ran a test for a well known search engine site and it worked fine from Scranton so it’s likely not a problem with them exactly.

Our theory

We think this might be hitting a bad instance of a load balancer for www.ascentvictorypark.com and getting “stuck” to the bad instance because it’s probably hashing by source IP.

If you know about this site: Can you tell us more about where this other site is being hosted? Can you try and contact their provider and get them to look at their own load balancers?

For now, this should work for your app unless your instance gets scheduled on a specific server in ORD. This shouldn’t happen since you have a volume and therefore your instance will always be scheduled on a server where your volume is available, if you’re mounting it in your fly.toml.

2 Likes

Hey @jerome, thanks for looking into this issue!

I think your theory makes a lot of sense (and I’m feeling a little dumb for not considering that the issue might be on the other side).

To fetch assets we’re using a job processing library (Oban) with a simple retry and back-off strategy. Since my last reply in this thread, I had disabled Oban on our Fly instance so it hasn’t made any requests since.

I just retried now and can confirm that the requests are going through. While the DO instance that was fetching them without issues before has suddenly started dropping connections!


Interestingly, all websites that were dropping the connection previously are working fine now. I didn’t realize this before but all of them use the same SaaS service provider – which gives more credibility to your theory. I will reach out to them and report the issue!


Solution

I’ve come up with a very simple solution for now. Since we can’t change regions of apps that have volumes mounted, I’ve split the app into two.

  1. First one that just serves the app to end-users with Oban disabled in the main dfw region with the volume.
  2. Second one as a worker service with only Oban enabled which I can just move to a different region whenever we encounter this issue again (until their provider fixes this on their end).

Thank you so much for helping out!


Also, thanks for linking to wheresitup.com! Very useful service!

3 Likes