slow requests during deploys

Hello! I’ve been having fun playing with fly.io – thanks for building it! I noticed something a bit strange during deploys – right after a deploy, sometimes requests will take ~10 seconds to complete for a period of ~30 seconds until things go back to normal.

Here’s the output of curl -v https://deploy-test.fly.dev | ts for one of the slow requests. It looks like the TLS handshake finishes right away, but then it takes ~10 seconds from when the TLS handshake finishes to when the body of the response gets sent.

Feb 03 19:35:29 * TCP_NODELAY set
Feb 03 19:35:29 * Connected to deploy-test.fly.dev (213.188.195.217) port 443 (#0)
Feb 03 19:35:29 * ALPN, offering h2
Feb 03 19:35:29 * ALPN, offering http/1.1
Feb 03 19:35:29 * successfully set certificate verify locations:
Feb 03 19:35:29 *   CAfile: /etc/ssl/certs/ca-certificates.crt
Feb 03 19:35:29   CApath: /etc/ssl/certs
Feb 03 19:35:29 } [5 bytes data]
Feb 03 19:35:29 * TLSv1.3 (OUT), TLS handshake, Client hello (1):
Feb 03 19:35:29 } [512 bytes data]
Feb 03 19:35:29 * TLSv1.3 (IN), TLS handshake, Server hello (2):
Feb 03 19:35:29 { [122 bytes data]
Feb 03 19:35:29 * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
Feb 03 19:35:29 { [19 bytes data]
Feb 03 19:35:29 * TLSv1.3 (IN), TLS handshake, Certificate (11):
Feb 03 19:35:29 { [2252 bytes data]
Feb 03 19:35:29 * TLSv1.3 (IN), TLS handshake, CERT verify (15):
Feb 03 19:35:29 { [79 bytes data]
Feb 03 19:35:29 * TLSv1.3 (IN), TLS handshake, Finished (20):
Feb 03 19:35:29 { [52 bytes data]
Feb 03 19:35:29 * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
Feb 03 19:35:29 } [1 bytes data]
Feb 03 19:35:29 * TLSv1.3 (OUT), TLS handshake, Finished (20):
Feb 03 19:35:29 } [52 bytes data]
Feb 03 19:35:29 * SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
Feb 03 19:35:29 * ALPN, server accepted to use h2
Feb 03 19:35:29 * Server certificate:
Feb 03 19:35:29 *  subject: CN=*.fly.dev
Feb 03 19:35:29 *  start date: Dec 13 23:51:29 2020 GMT
Feb 03 19:35:29 *  expire date: Mar 13 23:51:29 2021 GMT
Feb 03 19:35:29 *  issuer: C=US; O=Let's Encrypt; CN=R3
Feb 03 19:35:29 *  SSL certificate verify ok.
Feb 03 19:35:29 * Using HTTP2, server supports multi-use
Feb 03 19:35:29 * Connection state changed (HTTP/2 confirmed)
Feb 03 19:35:29 * Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
Feb 03 19:35:29 } [5 bytes data]
Feb 03 19:35:29 * Using Stream ID: 1 (easy handle 0x56137c1eedf0)
Feb 03 19:35:29 } [5 bytes data]
Feb 03 19:35:29 > GET / HTTP/2
Feb 03 19:35:29 > Host: deploy-test.fly.dev
Feb 03 19:35:29 > user-agent: curl/7.68.0
Feb 03 19:35:29 > accept: */*
Feb 03 19:35:29 > 
Feb 03 19:35:29 { [5 bytes data]
Feb 03 19:35:29 * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
Feb 03 19:35:29 { [81 bytes data]
Feb 03 19:35:29 * Connection state changed (MAX_CONCURRENT_STREAMS == 4294967295)!
Feb 03 19:35:29 } [5 bytes data]
Feb 03 19:35:41 < HTTP/2 200 
Feb 03 19:35:41 < server: Fly/7226d36 (2021-01-25)
Feb 03 19:35:41 < date: Thu, 04 Feb 2021 00:35:41 GMT
Feb 03 19:35:41 < content-type: text/html
Feb 03 19:35:41 < content-length: 5
Feb 03 19:35:41 < last-modified: Thu, 04 Feb 2021 00:21:48 GMT
Feb 03 19:35:41 < via: 2 fly.io
Feb 03 19:35:41 < 
Feb 03 19:35:41 
Feb 03 19:35:41 
Feb 03 19:35:41 hola

I also stared at the timestamps in the server logs a bunch and I think the rough timeline is:

  • 19:35:29 - make request with curl
  • 19:35:41 - VM receives request
  • 19:35:41 - client receives response

Here’s a gist with all the configuration for my test app. My best guess was that I hadn’t configured it to healthcheck frequently enough, but I changed my healthcheck interval to 1s and got the same results.

2 Likes

perhaps there’s an issue, because I my deploy is stuck at Monitoring Deployment

image

This is likely our slow service discovery (also related to 502s). When we get a connection to your app, it’ll try the VMs our edge node knows about. During a deploy, it “knows about” the soon-to-be-gone VMs for longer than it should, so I think what’s happening is that it tries the request, the request fails, and then it retries on the newly created VM.

@jerome is working on this, it should be better.

@austincollinpena your hung deploy is a different problem, the host it’s assigned to is taking forever. Seattle is popular today. :slight_smile:

1 Like

Thanks for the update!

1 Like

Thanks so much for the super quick reply!

1 Like

@austincollinpena we have an alert for hosts when they get wedged like that now, so we should know before you do if something like that happens again.

@julia someday our service discovery will be faster than our community reply time.

1 Like

Thanks for sharing that. Scaling problems seem like good ones to have :slight_smile:

1 Like

@julia this should now be in a much better state. We’ve rolled out speedups for replicating our state. I’ve measured a definite improvement (services seem reachable within 1s of them becoming healthy).

thank you! my end-to-end time is looking really good now – from when I start the VM to when it’s possible to SSH into it seems to be about 12 seconds now, which is about what I was hoping for!

This is still happening for me

How many instances do you have running?

2

This is likely because our internal services have been slow to propagate. We’re working on the problem! If you run 4+ instances you probably won’t have this issue (just because it takes long enough to pull old ones down), but it would be silly to spend money for that I think.