Elevated error rates

hello Fly friends,

I’m seeing highly elevated error rates across most of (all of?) my fly deployments. Started a few hours ago. For example, at this URL I’m getting some 599s. https://nikola-sharder.nikolaapp.com/shard_me?identifier=david%2B3@nikolaapp.com

The above is a tornado deployment. I’m also seeing it for other services, including some that have a simple nginx setup. Is there something going on? Thank you

David

Update: when I tried to update the above service, I got the following:
==> Pushing Image

The push refers to repository [registry.fly.io/nikola-sharder]
Error Get https://registry.fly.io/v2/: net/http: TLS handshake timeout

Oh, that’s odd.

@david could you provide a traceroute and traceoute6 (ipv6) to registry.fly.io?

599s? We’re only serving 502 and 503 status codes. I don’t see any 599 for your app within the last 6 hours.

Sure! So here’s some other interesting datapoints. Updown.io is showing no problems but when I try to ping servers from my local machine and from other machines out in the wild (mostly digital ocean) I often get timeouts.

The push refers to repository [registry.fly.io/nikola-sharder]

% traceroute registry.fly.io
traceroute to registry.fly.io (77.83.143.220), 64 hops max, 52 byte packets
1 10.0.0.1 (10.0.0.1) 1.491 ms 1.215 ms 1.006 ms
2 192.168.99.1 (192.168.99.1) 1.260 ms 1.256 ms 1.167 ms
3 148-64-111-65.public.monkeybrains.net (148.64.111.65) 3.077 ms 3.061 ms 3.303 ms
4 172.17.19.170 (172.17.19.170) 3.296 ms 3.334 ms 3.187 ms
5 172.17.18.50 (172.17.18.50) 2.303 ms 1.749 ms 1.683 ms
6 172.17.22.244 (172.17.22.244) 1.659 ms 2.490 ms 1.553 ms
7 208.52.0.73 (208.52.0.73) 1.908 ms 2.646 ms 1.926 ms
8 192.175.30.252 (192.175.30.252) 2.405 ms 2.695 ms 2.323 ms
9 192.175.29.226 (192.175.29.226) 3.132 ms 3.334 ms 3.185 ms
10 be13.cr2-55smarket.bb.as11404.net (192.175.30.220) 5.641 ms 5.044 ms 4.646 ms
11 be11.cr3-11greatoaks.bb.as11404.net (192.175.30.38) 5.131 ms 5.423 ms 5.087 ms
12 cr1-9greatoaks-be3.bb.as11404.net (192.175.30.214) 5.058 ms 4.987 ms 5.120 ms
13 * * *
14 * * *
15 * * *
16 * *

Don’t have traceroute6 installed. Will look at installing it.

Thanks!

You might be able to do traceroute -6 instead.

---- This is from Digital Ocean
$ traceroute -6 registry.fly.io

traceroute to registry.fly.io (2a09:8280:1:f28:246e:d6a:949:dbbf), 30 hops max, 80 byte packets

connect: Network is unreachable

$ traceroute registry.fly.io
traceroute to registry.fly.io (77.83.143.220), 30 hops max, 60 byte packets
1 * * *
2 10.88.2.65 (10.88.2.65) 0.686 ms 0.652 ms 10.88.2.47 (10.88.2.47) 0.628 ms
3 138.197.248.100 (138.197.248.100) 0.777 ms 0.768 ms 138.197.248.96 (138.197.248.96) 0.712 ms
4 138.197.246.9 (138.197.246.9) 1.623 ms 1.616 ms 138.197.246.5 (138.197.246.5) 1.797 ms
5 * * *
6 * * *


Looks like traceroute -6 doesn’t work on my mac traceroute

Hmm, I can’t quite tell which region that’s hitting. Can you provide the results of curl -I http://registry.fly.io -H "flyio-debug: doit" from wherever it’s failing?

Thanks so much btw. Here you go from my local machine where requests don’t always fail but sometimes take a while. Also it might be better to look up my “proxy-sea” service instead because that’s just nginx, so there are fewer confounding variables, like my tornado instance. While I don’t believer it do be the case, nikola-sharder could have some tornado bug causing a stall.

% curl -I http://registry.fly.io -H “flyio-debug: doit”

HTTP/1.1 307 Temporary Redirect

server: Fly/dd3da43 (2020-11-13)

content-type: text/html; charset=utf-8

location: https://fly.io

date: Sat, 14 Nov 2020 20:22:13 GMT

via: 1.1 vegur, 1.1 fly.io

content-length: 0

flyio-debug: {“bn”:“worker-pkt-ny5-429d”,“n”:“edge-nac-sjc1-e241”,“nr”:“sjc”,“nrtt”:70,“ra”:“148.64.111.68”,“sdc”:“ny5”,“sid”:“e4d80d2c”,“sr”:“ewr”,“st”:0,“tid”:“09542b10-9846-43ef-8a3d-891ebf43846d”}

For the sake of transparency, just wanted to report I’m also seeing some failures from digital ocean to digital ocean. It’s pretty hard for me to explain all these happening at once.

And this is from digital ocean. $ curl -I http://registry.fly.io -H “flyio-debug: doit”

HTTP/1.1 307 Temporary Redirect

server: Fly/dd3da43 (2020-11-13)

content-type: text/html; charset=utf-8

location: https://fly.io

date: Sat, 14 Nov 2020 20:28:54 GMT

via: 1.1 vegur, 1.1 fly.io

content-length: 0

flyio-debug: {“bn”:“worker-pkt-ny5-429d”,“n”:“edge-nac-sjc1-e241”,“nr”:“sjc”,“nrtt”:70,“ra”:“161.35.239.211”,“sdc”:“ny5”,“sid”:“e4d80d2c”,“sr”:“ewr”,“st”:0,“tid”:“c16a8277-97df-4bc1-89d9-4b5142a3960a”}

@david Thanks! Looks like these server were under high load. We’ve relieved the pressure a bit. We’ll be working on a more permanent fix.

Should be better for now.

Okay, thank you. And to clarify, was there an issue with my servers where I should consider adding more instances or something or is it about some intermediate fly servers that are under your control? Thanks!

This one’s on us! We need to add a bit more capacity in California to serve all the traffic ti gets.

1 Like

Thank you @jerome!

How it started:

How it’s going:
CEF42B54-FB5E-4CF1-9F09-944E0CCC9243-1570-00017DF43DEA204C

2 Likes