Downtime on SSL?

I just got a notification from our uptime monitor that our prod app is down.

The initial error was about SSL connection: “SSL connect error”

This is what wget is saying:

Up again after 6 mins. 6 mins of downtime due to SSL errors :l

I do see that in our metrics. Taking a closer look.

2 Likes

Do you know at what time that was exactly? I assume it happened at the time you posted.

I think this happened because there’s only 1 instance of your web process and it was being deployed on a server that was struggling. I see we had to restart our scheduler on that host near the time this happened. You’ll probably want 2+ instances if you need high availability.

If our proxy can’t find any healthy instance of your app when it gets a new connection, it won’t know if it should try to handle a TLS handshake or not, so it tries to find a healthy instance a few times, in a loop, and then gives up if it can’t.

We don’t create an incident when we restart the scheduler on a host, this happens frequently and it’s generally a better outcome than leaving a struggling host untouched.

Unfortunately, that’s the reality until “apps v2” (based on Fly Machines) which will use our home-grown scheduler.

Still, this should happen less and less frequently as we’re improving other parts of our infrastructure every week.

1 Like

Do you know at what time that was exactly? I assume it happened at the time you posted.

According to our uptime tracker (betteruptime) it started at 07:04pm CET. At 07:10pm CET it was marked as recovered. I do have to add that betteruptime is doing checks once every 3 minutes, so it might have started a bit earlier and may have resolved a bit earlier than 07:10pm CET.

You’ll probably want 2+ instances if you need high availability.

Good to know.

We don’t create an incident when we restart the scheduler on a host, this happens frequently and it’s generally a better outcome than leaving a struggling host untouched.

What do you mean with “create an incident”? And what is a “scheduler” in the context of Fly.io?

Thanks btw for your helping!

I meant a status.fly.io incident.

Our “scheduler” is Nomad. That’s what decides where to put app instances based on constraints. Unfortunately, it doesn’t work so good for us, that’s why we’re changing to our own thing.

1 Like