SSL cert handshake failed abruptly at 5:01 EST

Hello,

My Fly.dev instance cert was working fine up until around 10 PM GMT / 5 EST.

I have no idea why it stopped working, and deleting it and trying again is doing nothing.

Will you run curl -v https://<app>.fly.dev -D - -o /dev/null and share the output here? There won’t be anything sensitive in it.

Thanks, I didn’t think of getting the curl logs.

Here’s the response, the HTTPS handshake is failing on the direct URL as well:

❯ curl -v https://blockade.fly.dev -D - -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 2a09:8280:1::3:b4a9:443...
* Connected to blockade.fly.dev (2a09:8280:1::3:b4a9) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*  CAfile: /etc/ssl/cert.pem
*  CApath: none
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
} [230 bytes data]
* LibreSSL SSL_connect: Connection reset by peer in connection to blockade.fly.dev:443
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
* Closing connection 0
curl: (35) LibreSSL SSL_connect: Connection reset by peer in connection to blockade.fly.dev:443

I’ve tried making a new sample express app from scratch and still getting the same issue.

I’m at least able to SSH into the instances, and I can see they’re running on internal port 8080 which lines up with the fly.toml definition.

I am experiencing the same issue.

I have noticed it affects only IPv6 connectivity, things are ok over IPv4.

1 Like

Same! All of my down apps are also v6 addresses. That could be a clue.

I have used the hello world node app from the docs as a new example, and even this fails with a ERR_CONNECTION_RESET error.

I think there’s an an outage going on here.

@kurt could we have this investigated?

1 Like

I can confirm that it’s IPv6 that fails:

* Connected to xxxx.net (2a09:8280:1::3:b885) port 443 (#0)
* schannel: disabled automatic use of client certificate
* ALPN: offers http/1.1
* schannel: failed to receive handshake, SSL/TLS connection failed
* Closing connection 0
curl: (35) schannel: failed to receive handshake, SSL/TLS connection failed```

While the same app over IPv4 works:

*   Trying 66.241.124.70:443...
* Connected to xxxx.fly.dev (66.241.124.70) port 443 (#0)
* schannel: disabled automatic use of client certificate
* ALPN: offers http/1.1
* ALPN: server accepted http/1.1
> GET / HTTP/1.1
> Host: chainspider.fly.dev
> User-Agent: curl/7.83.1
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
...

It’s been a rather frustrating afternoon trying to deploy a second app, couldnt figure out why some clients worked and others didn’t until I clued in that it’s an IPv6 specific issue.

@mike-ravkine at least you have some apps up. I have a production app down for hours now.

I don’t see a way to launch a new app via CLI to use a v4 address unfortunately:

Usage:
  flyctl launch [flags]

Flags:
      --auto-confirm                  Will automatically confirm changes when running non-interactively.
      --build-arg strings             Set of build time variables in the form of NAME=VALUE pairs. Can be specified multiple times.
      --build-only                    Build but do not deploy
      --build-secret strings          Set of build secrets of NAME=VALUE pairs. Can be specified multiple times. See https://docs.docker.com/develop/develop-images/build_enhancements/#new-docker-build-secret-information
      --build-target string           Set the target build stage to build if the Dockerfile has more than one stage
      --copy-config                   Use the configuration file if present without prompting
      --detach                        Return immediately instead of monitoring deployment progress
      --dockerfile string             Path to a Dockerfile. Defaults to the Dockerfile in the working directory.
      --dockerignore-from-gitignore   If a .dockerignore does not exist, create one from .gitignore files
  -e, --env strings                   Set of environment variables in the form of NAME=VALUE pairs. Can be specified multiple times.
      --generate-name                 Always generate a name for the app, without prompting
  -h, --help                          help for launch
      --ignorefile string             Path to a Docker ignore file. Defaults to the .dockerignore file in the working directory.
  -i, --image string                  The Docker image to deploy
      --image-label string            Image label to use when tagging and pushing to the fly registry. Defaults to "deployment-{timestamp}".
      --local-only                    Only perform builds locally using the local docker daemon
      --name string                   Name of the new app
      --nixpacks                      Deploy using nixpacks to build the image
      --no-cache                      Do not use the build cache when building the image
      --no-deploy                     Do not prompt for deployment
      --now                           Deploy now without confirmation
  -o, --org string                    The target Fly organization
      --path string                   Path to the app source root, where fly.toml file will be saved (default ".")
      --push                          Push image to registry after build is complete
  -r, --region string                 The target region (see 'flyctl platform regions')
      --remote-only                   Perform builds on a remote builder instance instead of using the local docker daemon
      --strategy string               The strategy for replacing running instances. Options are canary, rolling, bluegreen, or immediate. Default is canary, or rolling when max-per-region is set.

Global Flags:
  -t, --access-token string   Fly API Access Token
  -j, --json                  json output
      --verbose               verbose output

I don’t seem to see a way to force an v4 address, I’ll keep looking.

I found a way to allocate an IPv4 address, but I’m not sure how to assign it to an app:

If you can control the DNS, drop the AAAA record. If your app is using fly dot dev I dont think you can do anything to make clients prefer the IPv4.

Thanks! I’ll try that next.

PS - that command uses the current app directory as context. So it does assign an ipv4 address. It is a work around if the “naked” [app].fly.dev domain is an option for you.

Alright for tonight I want to get back to my family. I’m not sure if this is a solution for those using a certificate for your own domain, but if you can use the [app].fly.dev directly here’s a workaround.

  1. cd into your app directory
  2. fly ips allocate-v4
  3. https://[app].fly.dev will be openable and the SSL handshake succeeds!

Why ipv4 works and the default ipv6 does not, I’m not sure.

Thank you so much for the tip @mike-ravkine this would have driven me crazy all night.

This was likely due to a deploy at the time, the issue should now be resolved.

Thanks @jerome

Given that this was a serious outage that brought down our services and wasn’t noticed for over 12 hours by the Fly team, could we have some kind of reprieve for the harm done here?

Luckily we were able to spot the bug on our own, but this lack of a timely recognition and response here is worrying.

This was likely due to a lot of unfortunate factors happening all at once:

  • It only affected apps with a shared IPv4 and dedicated IPv6
  • Only users on IPv6
  • Not many of us use an ISP providing IPv6 connectivity
  • Happened at the beginning of a major holiday for many of us at Fly
  • The issue was part of a week-old commit that hadn’t been deployed, making it non-obvious what caused it. Still, it was too long before somebody could even look at it in a focused way.

None of it excuses the downtime here, but it didn’t help.

I was the one who wrote the faulty code and deployed the proxy at the time the outage started. We didn’t notice right away. We were fixing a different issue which was resolved by the latest deployment.

This doesn’t usually happen. We have plans to add better monitoring for IPv6 so it has less chances to happen in the future.