Elevated error rates

hello Fly friends,

I’m seeing highly elevated error rates across most of (all of?) my fly deployments. Started a few hours ago. For example, at this URL I’m getting some 599s. https://nikola-sharder.nikolaapp.com/shard_me?identifier=david%2B3@nikolaapp.com

The above is a tornado deployment. I’m also seeing it for other services, including some that have a simple nginx setup. Is there something going on? Thank you

David

Update: when I tried to update the above service, I got the following:
==> Pushing Image

The push refers to repository [registry.fly.io/nikola-sharder]
Error Get https://registry.fly.io/v2/: net/http: TLS handshake timeout

Oh, that’s odd.

@david could you provide a traceroute and traceoute6 (ipv6) to registry.fly.io?

599s? We’re only serving 502 and 503 status codes. I don’t see any 599 for your app within the last 6 hours.

Sure! So here’s some other interesting datapoints. Updown.io is showing no problems but when I try to ping servers from my local machine and from other machines out in the wild (mostly digital ocean) I often get timeouts.

The push refers to repository [registry.fly.io/nikola-sharder]

% traceroute registry.fly.io
traceroute to registry.fly.io (77.83.143.220), 64 hops max, 52 byte packets
1 10.0.0.1 (10.0.0.1) 1.491 ms 1.215 ms 1.006 ms
2 192.168.99.1 (192.168.99.1) 1.260 ms 1.256 ms 1.167 ms
3 148-64-111-65.public.monkeybrains.net (148.64.111.65) 3.077 ms 3.061 ms 3.303 ms
4 172.17.19.170 (172.17.19.170) 3.296 ms 3.334 ms 3.187 ms
5 172.17.18.50 (172.17.18.50) 2.303 ms 1.749 ms 1.683 ms
6 172.17.22.244 (172.17.22.244) 1.659 ms 2.490 ms 1.553 ms
7 208.52.0.73 (208.52.0.73) 1.908 ms 2.646 ms 1.926 ms
8 192.175.30.252 (192.175.30.252) 2.405 ms 2.695 ms 2.323 ms
9 192.175.29.226 (192.175.29.226) 3.132 ms 3.334 ms 3.185 ms
10 be13.cr2-55smarket.bb.as11404.net (192.175.30.220) 5.641 ms 5.044 ms 4.646 ms
11 be11.cr3-11greatoaks.bb.as11404.net (192.175.30.38) 5.131 ms 5.423 ms 5.087 ms
12 cr1-9greatoaks-be3.bb.as11404.net (192.175.30.214) 5.058 ms 4.987 ms 5.120 ms
13 * * *
14 * * *
15 * * *
16 * *

Don’t have traceroute6 installed. Will look at installing it.

Thanks!

You might be able to do traceroute -6 instead.

---- This is from Digital Ocean
$ traceroute -6 registry.fly.io

traceroute to registry.fly.io (2a09:8280:1:f28:246e:d6a:949:dbbf), 30 hops max, 80 byte packets

connect: Network is unreachable

$ traceroute registry.fly.io
traceroute to registry.fly.io (77.83.143.220), 30 hops max, 60 byte packets
1 * * *
2 10.88.2.65 (10.88.2.65) 0.686 ms 0.652 ms 10.88.2.47 (10.88.2.47) 0.628 ms
3 138.197.248.100 (138.197.248.100) 0.777 ms 0.768 ms 138.197.248.96 (138.197.248.96) 0.712 ms
4 138.197.246.9 (138.197.246.9) 1.623 ms 1.616 ms 138.197.246.5 (138.197.246.5) 1.797 ms
5 * * *
6 * * *


Looks like traceroute -6 doesn’t work on my mac traceroute

Hmm, I can’t quite tell which region that’s hitting. Can you provide the results of curl -I http://registry.fly.io -H "flyio-debug: doit" from wherever it’s failing?

Thanks so much btw. Here you go from my local machine where requests don’t always fail but sometimes take a while. Also it might be better to look up my “proxy-sea” service instead because that’s just nginx, so there are fewer confounding variables, like my tornado instance. While I don’t believer it do be the case, nikola-sharder could have some tornado bug causing a stall.

% curl -I http://registry.fly.io -H “flyio-debug: doit”

HTTP/1.1 307 Temporary Redirect

server: Fly/dd3da43 (2020-11-13)

content-type: text/html; charset=utf-8

location: https://fly.io

date: Sat, 14 Nov 2020 20:22:13 GMT

via: 1.1 vegur, 1.1 fly.io

content-length: 0

flyio-debug: {“bn”:“worker-pkt-ny5-429d”,“n”:“edge-nac-sjc1-e241”,“nr”:“sjc”,“nrtt”:70,“ra”:“148.64.111.68”,“sdc”:“ny5”,“sid”:“e4d80d2c”,“sr”:“ewr”,“st”:0,“tid”:“09542b10-9846-43ef-8a3d-891ebf43846d”}

For the sake of transparency, just wanted to report I’m also seeing some failures from digital ocean to digital ocean. It’s pretty hard for me to explain all these happening at once.

And this is from digital ocean. $ curl -I http://registry.fly.io -H “flyio-debug: doit”

HTTP/1.1 307 Temporary Redirect

server: Fly/dd3da43 (2020-11-13)

content-type: text/html; charset=utf-8

location: https://fly.io

date: Sat, 14 Nov 2020 20:28:54 GMT

via: 1.1 vegur, 1.1 fly.io

content-length: 0

flyio-debug: {“bn”:“worker-pkt-ny5-429d”,“n”:“edge-nac-sjc1-e241”,“nr”:“sjc”,“nrtt”:70,“ra”:“161.35.239.211”,“sdc”:“ny5”,“sid”:“e4d80d2c”,“sr”:“ewr”,“st”:0,“tid”:“c16a8277-97df-4bc1-89d9-4b5142a3960a”}

@david Thanks! Looks like these server were under high load. We’ve relieved the pressure a bit. We’ll be working on a more permanent fix.

Should be better for now.

Okay, thank you. And to clarify, was there an issue with my servers where I should consider adding more instances or something or is it about some intermediate fly servers that are under your control? Thanks!

This one’s on us! We need to add a bit more capacity in California to serve all the traffic ti gets.

1 Like

Thank you @jerome!

How it started:

How it’s going:
CEF42B54-FB5E-4CF1-9F09-944E0CCC9243-1570-00017DF43DEA204C

2 Likes

Hi guys,

I am seeing similar errors. Here are the traceroutes …

$ traceroute -4 api.fly.io

traceroute to api.fly.io (77.83.143.220), 30 hops max, 60 byte packets
1 _gateway (192.168.49.1) 0.489 ms 0.672 ms 0.476 ms
2 192.168.177.1 (192.168.177.1) 1.231 ms 3.588 ms 3.574 ms
3 62.156.244.3 (62.156.244.3) 19.829 ms 16.389 ms 20.731 ms
4 62.156.245.198 (62.156.245.198) 22.283 ms 23.975 ms 20.684 ms
5 pd900c672.dip0.t-ipconnect.de (217.0.198.114) 31.601 ms f-ed12-i.F.DE.NET.DTAG.DE (217.5.67.166) 31.037 ms pd900c61e.dip0.t-ipconnect.de (217.0.198.30) 27.837 ms
6 62.157.249.186 (62.157.249.186) 32.883 ms 35.700 ms 36.899 ms
7 ae-2.r21.frnkge13.de.bb.gin.ntt.net (129.250.6.41) 33.200 ms ae-2.r20.frnkge13.de.bb.gin.ntt.net (129.250.6.13) 22.179 ms 24.196 ms
8 ae-2.a00.frnkge13.de.bb.gin.ntt.net (129.250.4.81) 25.932 ms ae-0.a00.frnkge13.de.bb.gin.ntt.net (129.250.2.25) 25.442 ms 25.418 ms
9 * * *
10 * * *
11 * * *
12 * * *
13 * * *
14 * * *
15 * * *

$ traceroute -6 api.fly.io

traceroute to api.fly.io (2a09:8280:1:f28:246e:d6a:949:dbbf), 30 hops max, 80 byte packets
1 2003:a:1344:24fc:: (2003:a:1344:24fc::slight_smile: 0.409 ms 0.663 ms 0.447 ms
2 2003:a:1344:2400:e228:6dff:fe6b:db2a (2003:a:1344:2400:e228:6dff:fe6b:db2a) 1.959 ms 3.004 ms 2.808 ms
3 2003:0:1406:6419::1 (2003:0:1406:6419::1) 18.567 ms 18.555 ms 19.113 ms
4 2003:0:1406:2410::2 (2003:0:1406:2410::2) 19.816 ms 20.031 ms 20.743 ms
5 e0-51.switch2.fra2.he.net (2001:470:0:5f6::1) 27.225 ms 26.358 ms 27.413 ms
6 e0-34.core2.ams2.he.net (2001:470:0:4b7::2) 34.424 ms 27.266 ms *
7 100ge2-1.core1.ams1.he.net (2001:470:0:489::1) 41.138 ms 41.125 ms 41.112 ms
8 amsix.as36236.net (2001:7f8:1::a503:6236:1) 35.961 ms 32.552 ms 41.109 ms
9 2607:f740:d:10::4 (2607:f740:d:10::4) 46.596 ms 46.584 ms 36.058 ms
10 2607:f740:d:16::2 (2607:f740:d:16::2) 35.256 ms 35.479 ms 35.467 ms
11 * * *
12 * * *
13 * * *
14 * * *
15 * * *
16 …

Are there any problems with the API servers?

What kind of errors are you seeing? We haven’t noticed an increases in error rates today.

Also, when you visit https://debug.fly.dev, what value do you see for the Fly-Region header?

Since I forgot to write the exact error message, here it is:

$ flyctl auth login --email EMAIL
Error Post “https://api.fly.io/api/v1/sessions”: net/http: TLS handshake timeout

This error also appeared when I tried to install flyctl on a different machine (cURL from the install script could not connect to fly.io).

Here is the output from https://debug.fly.io (used cURL to do it and it only worked after the third try).

=== Headers ===
Host: debug.fly.dev
Accept: */*
Fly-Client-Ip: 2003:a:1344:24fc:215:5dff:feb1:c80e
X-Forwarded-For: 2003:a:1344:24fc:215:5dff:feb1:c80e, 2a09:8280:1:763f:8bdd:34d1:c624:78cd
X-Forwarded-Ssl: on
Fly-Region: ams
Via: 2 fly.io
X-Request-Start: t=1626812776885238
Fly-Forwarded-Proto: https
Fly-Forwarded-Ssl: on
X-Forwarded-Port: 443
Fly-Dispatch-Start: t=1626812776885474;instance=21fd33ce
User-Agent: curl/7.77.0
Fly-Forwarded-Port: 443
X-Forwarded-Proto: https
Fly-Request-Id: 01FB2SJ0DNY97YXY933T0Y0WXH

=== ENV ===
FLY_ALLOC_ID=21fd33ce-ad2a-63d5-fbc8-21471ec191df
FLY_APP_NAME=debug
FLY_PUBLIC_IP=2607:f740:d:27:0:21fd:33ce:1
FLY_REGION=ams
FLY_VM_MEMORY_MB=128
HOME=/root
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
TERM=linux
WS=this
is
a
test
cgroup_enable=memory

2021-07-20 20:26:16.886478836 +0000 UTC m=+38419.142037694

Oh wow, now the flyctl login worked …

I see a few TLS handshake EOF errors to the API through Amsterdam. This seems like a network issue between you and AMS, if it cleared up we probably can’t figure out why, if it happens again you can post here and we can do some more digging.