App suddenly getting 599s

davidhodge · March 16, 2021, 12:48am

One of my apps (nikola-sharder) is seeing major delays and often 599s suddenly with no changes on my end. I’ve just tried to re-deploy but the issue remains? Any ideas what might be up here?

I’m going to try removing and re-adding regions. Thanks!

Update: looks like the requests are failing when they happen from my digital ocean servers, but not from my local network.

kurt · March 16, 2021, 1:01am

We don’t have anything that would generate a 599, are you running a proxy in front of your fly app? We had some issues with our routing layer that might be the cause of this: https://status.flyio.net/

davidhodge · March 16, 2021, 3:12am

hey sorry I forgot you don’t give 599s. My tornado health checker is reporting 599s, but the real issue is timeouts I think. The 599s are synthetic, but the timeouts are real. Updown.io reports intermittent TLS timeouts.

No proxy in front of my Fly App. Btw, is there a way to explicitly route a request to each region to narrow in on the issue.

kurt · March 16, 2021, 3:15am

You can set a fly-prefer-region header to get a specific region now.

davidhodge · March 16, 2021, 4:14am

okay so the problem is persisting after I tried removing and adding regions, redeploying, and scaling up / down. Here’s a curl that’s failing often with a TLS timeout. Am going to try checking region by region next. Any ideas? Thanks!

curl -gkvL -b '' -H 'User-Agent: [updown.io](http://updown.io/) daemon 2.6' -H 'Accept: */*' -H 'Connection: Close' -H 'Accept-Language: en' -m 30 --connect-timeout 10 [https://nikola-sharder.nikolaapp.com/shard_me\?identifier\=david\%2B3@nikolaapp.com

davidhodge · March 16, 2021, 4:24am

Okay so I tried preferring regions and the TLS issue appears intermittently in all regions I tried. This is the curl I used.

curl -gkvL -b '' -H 'fly-prefer-region: sjc' -H 'User-Agent: updown.io daemon 2.6' -H 'Accept: */*' -H 'Connection: Close' -H 'Accept-Language: en' -m 30 --connect-timeout 10 https://nikola-sharder.nikolaapp.com/shard_me\?identifier\=david\%2B3@nikolaapp.com

kurt · March 16, 2021, 4:33am

One limitation of fly-prefer-region is that TLS is still done closest to you. Intermittent TLS errors are strange!

Your app seems like it’s not currently running. Did you shut it off?

davidhodge · March 16, 2021, 4:36am

Yes, so I just suspended it to try the old “turn it off and turn it on again” trick. I’m happy to report that may have “fixed” the issue after I resumed it.

Though I’m not yet sure and I am still spooked as this is a key service for my app.

davidhodge · March 16, 2021, 4:36am

ah, yes that makes sense. I was wondering how that would work with anycast

davidhodge · March 16, 2021, 4:37am

by the way, I’m super grateful for you being responsive to this issue late at night. Really appreciate it.

kurt · March 16, 2021, 4:41am

I flushed the cached certificates for your app around the same time you restarted. Odds are, my flush fixed it. We might have had lingering cache issues for your cert from the router issues earlier today.

We’ll keep an eye on it.

davidhodge · March 16, 2021, 4:52am

Thank you Kurt!

jerome · March 16, 2021, 1:21pm

This morning, I noticed there were still some issues with TLS handshakes in certain regions. I did a thing (restart our proxy) that appears to have fixed it.

davidhodge · March 16, 2021, 5:10pm

thanks Jerome. Unfortunately, it looks like I’m still seeing some issues. Even after you did this. Any thoughts?

jerome · March 16, 2021, 5:11pm

We’ve been monitoring and are currently investigating the issue.

davidhodge · March 16, 2021, 5:35pm

great! thank you.

kurt · March 16, 2021, 5:57pm

You’re primarily connecting to this hostname with a Python client, right? There’s something putting only this one in a weird state.

davidhodge · March 16, 2021, 6:23pm

I have health checks running from python and via whatever updown.io uses. My actual clients hitting this are from iOS.

kurt · March 16, 2021, 10:12pm

We’re still on this, it’s very strange but I think we’re making progress! You should see far fewer check failures now (but you’ll still see them intermittently).

jerome · March 17, 2021, 4:43pm

@davidhodge I believe this is finally fixed. It took a while to identify the cause.

There was a bug causing a resource exhaustion when dealing with too many concurrency TLS handshakes that didn’t finish within 30s.

The fix has been out for just under 2 hours. We haven’t seen TLS errors for your app during that time. Usually, these would happen within 30 minutes.

Do you have a lot of users who might be taking a long time to TLS handshake or maybe initialize them and then abort? Might be a thing on iOS.

In any case, this should be good now.

Topic		Replies	Views
599s from Fly to Fly this evening	6	414	August 25, 2020
SSL Connection Issues after a deployment	21	791	April 13, 2021
Elevated error rates	20	1208	July 22, 2021
odd timeouts for a service	3	573	March 26, 2022
understanding spikes in P99	18	1629	July 21, 2021

App suddenly getting 599s

Related topics