App suddenly getting 599s

One of my apps (nikola-sharder) is seeing major delays and often 599s suddenly with no changes on my end. I’ve just tried to re-deploy but the issue remains? Any ideas what might be up here?

I’m going to try removing and re-adding regions. Thanks!

Update: looks like the requests are failing when they happen from my digital ocean servers, but not from my local network.

We don’t have anything that would generate a 599, are you running a proxy in front of your fly app? We had some issues with our routing layer that might be the cause of this: https://status.flyio.net/

hey sorry I forgot you don’t give 599s. My tornado health checker is reporting 599s, but the real issue is timeouts I think. The 599s are synthetic, but the timeouts are real. Updown.io reports intermittent TLS timeouts.

No proxy in front of my Fly App. Btw, is there a way to explicitly route a request to each region to narrow in on the issue.

You can set a fly-prefer-region header to get a specific region now.

1 Like

okay so the problem is persisting after I tried removing and adding regions, redeploying, and scaling up / down. Here’s a curl that’s failing often with a TLS timeout. Am going to try checking region by region next. Any ideas? Thanks!

curl -gkvL -b '' -H 'User-Agent: [updown.io](http://updown.io/) daemon 2.6' -H 'Accept: */*' -H 'Connection: Close' -H 'Accept-Language: en' -m 30 --connect-timeout 10 [https://nikola-sharder.nikolaapp.com/shard_me\?identifier\=david\%2B3@nikolaapp.com

Okay so I tried preferring regions and the TLS issue appears intermittently in all regions I tried. This is the curl I used.

curl -gkvL -b '' -H 'fly-prefer-region: sjc' -H 'User-Agent: updown.io daemon 2.6' -H 'Accept: */*' -H 'Connection: Close' -H 'Accept-Language: en' -m 30 --connect-timeout 10 https://nikola-sharder.nikolaapp.com/shard_me\?identifier\=david\%2B3@nikolaapp.com

One limitation of fly-prefer-region is that TLS is still done closest to you. Intermittent TLS errors are strange!

Your app seems like it’s not currently running. Did you shut it off?

Yes, so I just suspended it to try the old “turn it off and turn it on again” trick. I’m happy to report that may have “fixed” the issue after I resumed it.

Though I’m not yet sure and I am still spooked as this is a key service for my app.

ah, yes that makes sense. I was wondering how that would work with anycast

by the way, I’m super grateful for you being responsive to this issue late at night. Really appreciate it.

1 Like

I flushed the cached certificates for your app around the same time you restarted. Odds are, my flush fixed it. We might have had lingering cache issues for your cert from the router issues earlier today.

We’ll keep an eye on it.

1 Like

Thank you Kurt!

This morning, I noticed there were still some issues with TLS handshakes in certain regions. I did a thing (restart our proxy) that appears to have fixed it.

1 Like

thanks Jerome. Unfortunately, it looks like I’m still seeing some issues. Even after you did this. Any thoughts?

We’ve been monitoring and are currently investigating the issue.

1 Like

great! thank you.

You’re primarily connecting to this hostname with a Python client, right? There’s something putting only this one in a weird state.

I have health checks running from python and via whatever updown.io uses. My actual clients hitting this are from iOS.

We’re still on this, it’s very strange but I think we’re making progress! You should see far fewer check failures now (but you’ll still see them intermittently).

1 Like

@david I believe this is finally fixed. It took a while to identify the cause.

There was a bug causing a resource exhaustion when dealing with too many concurrency TLS handshakes that didn’t finish within 30s.

The fix has been out for just under 2 hours. We haven’t seen TLS errors for your app during that time. Usually, these would happen within 30 minutes.

Do you have a lot of users who might be taking a long time to TLS handshake or maybe initialize them and then abort? Might be a thing on iOS.

In any case, this should be good now.