Client network socket disconnected before secure TLS connection was established

ajbouh · July 31, 2021, 12:13am

I’ve been deploying my app via the GraphQL API and I’m having mixed results. The first deployment of an application seems to work fine. Trying to update the app seems to brick it.

Right now (on sandbox-cld-adamb-batch-8) I’m seeing request errors like: “Client network socket disconnected before secure TLS connection was established”

I don’t see anything interesting in the logs, so it seems like my traffic isn’t making to userland in the VM.

Poking around the API manually suggests that the application may not be placed properly. The only solution I’ve found so far is to delete the app and start over, but that’s … not such a good way to do deployments

Any ideas about how to debug what’s going on?

ajbouh · August 3, 2021, 8:01pm

(@kurt This is the issue I emailed about last week)

jerome · August 4, 2021, 11:56am

Hmm, this may happen if your app is restarting frequently because it becomes unhealthy. We only handle handshakes for apps that have at least 1 instance up and running.

Is it possible it’s crashing frequently?

ajbouh · August 4, 2021, 3:34pm

There’s nothing obvious in the logs about the app restarting. What’s strange is that my app reliably ends up in this state when I try to update it via graphql.

kurt · August 4, 2021, 7:09pm

Just to clarify, you deployed sandbox-cld-adamb-batch-8 and it worked fine, then you deployed again and it broke?

I’m not seeing a second version of that app, and what’s running there seems to be responding fine.

Client network socket disconnected before secure TLS connection was established

Just guessing here, but this error could be a number of things.

When you make a new app we actually add an A record and AAAA record to dnsimple, we’ve noticed those are occasionally slow to propagate.
If you deploy once, then deploy again with empty services this is the type of error I’d expect
There could be some lag between deploy and our proxy seeing your instance as healthy. This is growing pains on our end, we’re working furiously to make this fast.

ajbouh · August 4, 2021, 7:10pm

If it’s just lag, then the lag is measured in days in some cases. In one experiment, I saw that there was an AAAA record, but no A record.

kurt · August 4, 2021, 7:14pm

Oh that’s definitely broken. Can you share a list of GraphQL mutations you’re using? Also it looks like the app I was looking at is a replacement for another one that failed, will you get one to a failing state and leave it so we can have a look?

ajbouh · August 4, 2021, 8:44pm

sandbox-cld-adamb-batch-10 is wedged now

ajbouh · August 4, 2021, 8:56pm

Actually, it seems that the app has been deleted, which I guess is expected due to a bug on my end. Trying to better reproduce the wedged state…

ajbouh · August 4, 2021, 9:05pm

Ok, so it’s now back but not serving any new TLS requests

kurt · August 4, 2021, 9:07pm

It’s working here, but I think I may know what’s wrong.

Will you run dig a sandbox-cld-adamb-batch-10.fly.dev and dig aaaa sandbox-cld-adamb-batch-10.fly.dev, then compare those IPs to fly ips list?

New apps get entirely new sets of IP addresses, so if you could have problems if you have DNS lookups cached on a system, delete an app, and then create a new one.

ajbouh · August 4, 2021, 9:11pm

sandbox-cld-adamb-batch-10.fly.dev. 3599 IN AAAA 2a09:8280:1::1:606
sandbox-cld-adamb-batch-10.fly.dev. 3135 IN A     213.188.211.206

TYPE ADDRESS            CREATED AT 
v4   213.188.211.206    17m20s ago 
v6   2a09:8280:1::1:606 17m20s ago

» curl sandbox-cld-adamb-batch-10.fly.dev
curl: (56) Recv failure: Connection reset by peer

kurt · August 4, 2021, 9:29pm

Will you try curl -v and see what IP it’s using? If it’s using one of those apps IPs, will you also show me the output of curl https://debug.fly.dev?

ajbouh · August 4, 2021, 9:34pm

Interesting!

» curl -v sandbox-cld-adamb-batch-10.fly.dev
*   Trying 213.188.211.191...
* TCP_NODELAY set
* Connected to sandbox-cld-adamb-batch-10.fly.dev (213.188.211.191) port 80 (#0)
> GET / HTTP/1.1
> Host: sandbox-cld-adamb-batch-10.fly.dev
> User-Agent: curl/7.64.1
> Accept: */*
> 
* Recv failure: Connection reset by peer
* Closing connection 0
curl: (56) Recv failure: Connection reset by peer

=== Headers ===
Host: debug.fly.dev
Fly-Region: sjc
Via: 2 fly.io
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36
Sec-Fetch-Dest: document
Sec-Gpc: 1
X-Forwarded-For: 99.152.113.146, 77.83.140.164
Fly-Request-Id: 01FC9HCB4N2HXEPAMDS9GG94N3
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
X-Forwarded-Ssl: on
X-Forwarded-Port: 443
Fly-Forwarded-Proto: https
Fly-Dispatch-Start: t=1628112825493876;instance=cab64964
Dnt: 1
X-Request-Start: t=1628112825493439
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Fly-Client-Ip: 99.152.113.146
Fly-Forwarded-Port: 443
Sec-Ch-Ua: " Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"
Sec-Ch-Ua-Mobile: ?0
Sec-Fetch-Site: none
X-Forwarded-Proto: https
Fly-Forwarded-Ssl: on

=== ENV ===
FLY_ALLOC_ID=cab64964-e950-23d6-ca66-22231cf6787b
FLY_APP_NAME=debug
FLY_PUBLIC_IP=2604:1380:45e1:3001:0:cab6:4964:1
FLY_REGION=sjc
FLY_VM_MEMORY_MB=128
HOME=/root
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
TERM=linux
WS=this
is
a
test
cgroup_enable=memory

2021-08-04 21:33:45.495971215 +0000 UTC m=+4379029.236292125

kurt · August 4, 2021, 9:38pm

Well the good news is, that’s definitely the old IP. It should be using 213.188.211.206. So it seems like your launch mutation is working fine!

If you use unique app names, you probably won’t have this issue. Alternatively, you can figure out how to flush your DNS cache (which is some voodoo that varies by operating system).

ajbouh · August 4, 2021, 9:41pm

Good point. Unfortunately I don’t know that I control the DNS caches throughout my infrastructure (in AWS, etc).

I think there was a bug in my terraform module that was overly aggressive about creating/deleting on every deploy. Now that the bug is fixed I’ll keep an eye on things.