Slow dns lookup to custom domain

Hi!

I have a custom dns set up at one.com. I have added CNAME and AAAA for the custom domain but sometimes the lookup is very slow. It can take up to 10-20 sec for the domain to be found.

Do I need to do some more configuration?

The CNAME is connected to fly _acme-challange url and the AAAA is configured as the docs suggests.

It is a remix.run app that uses the remix.run serve adapter.

Regards
Mathias Nilsson

After checking itā€™s not DNS lookup that takes time itā€™s SSL and initial connection.

Strange, it should be milliseconds, rather than seconds.

Is this the very first time you connect to the app after a deploy (so within, say, 60 seconds of deploying)? Or does it persist and is still an issue once several minutes have passed?

Only there is a known issue (which is being fixed) where, after a deploy, it can take a minute or so for the new vm/instance to be found by the proxy, and that can result in large delays in requests being responded to.

If so, that can be mitigated by having a minimum of three vms and using a rolling deploy strategy to give time for them to be recognised:

Thanks for the fast response.

If I deploy a new Docker image it will take a long time. This is after I have hit refresh some time and let the app rest for a while. The first deploy takes up to 1 minute as you have explained.

This time it took 5 sek but it general takes longer.

Ah ā€¦

It depends what you mean by let it rest ā€œa whileā€. Under/over 60 seconds?

Only currently what happens (if you only have one vm/instance) is when you do a deploy, it can take about a minute before the app is fully ready to go (propagated/picked up by proxy, routing etc). That is after each deploy (as each deploy results in a new vm/instance).

But after a minute (or so) your app should then be ready. So delays of 5s+ beyond then are not normal and point to an issue.

So the question is are you getting 5 second response times still, like an hour after deploying?

If so, one thing to check your general routing vs. Fly is to visit https://debug.fly.dev/ in your browser. What do you see for the Fly-Region header in the response it shows? It should be one geographically close to you. Like ā€¦ Iā€™m in the UK, and I get a response back from ā€˜lhrā€™, so itā€™s a super-fast. That should be the speed you are getting. If you are e.g in Australia and get a response from e.g Europe, that would point to some anycast/route issue, which would indeed cause an initial delay. Unlikely but Iā€™ve had that with other providers in the past. Worth a check.

I am experiencing the same problem on my project on fly, where i have a domain on namecheap and I just benchmarked my site and got ā€œInitial connectionā€ and ā€œSSLā€ (in network tab) that took 7 seconds, and second benchmark (when SSL have been established) took the whole site load 400 ms. Hopefully I can help provide more info to solve this.

edit: Forgot to mention that last deploy to this project was 2 days ago and I am getting this issue consistently with no clear signs its just after deploys.

I tried the debug.fly.dev and got region ā€œfraā€ while in sweden.

So the question is are you getting 5 second response times still, like an hour after deploying?

Iā€™m getting the slow response like an hour, 15 hours etc after deploying. I Will check the region again but the fly.io instance is in Frankfurt and Iā€™m in Gotehnburg/Stockholm area sĆ„ it should be fast.

The dev url works fast all the time. https://subdomain.fly.dev/ always has a fast response (except the initial one after deploy)

The debug gives me Fly-Region: fra and I have deployed to that region

It doesnā€™t sound like a routing issue then for either of you since (as far as I know) Fly doesnā€™t have an edge in Sweden so a request being handled by a European region (France/Germany) would be correct.

And it canā€™t be the slow propagation issue on-deploy, as you both say itā€™s well after you deployed, which is when that problem is (for now) apparent.

Hmm.

Seems like this may be one for Fly Iā€™m afraid. They will have access to their proxy innards and could perhaps track a request, SSL handshake etc. If you can provide a Fly request ID (you can get that from the headers) for a request that takes ages, like 5 seconds, they may be able to see that in their logs and see what happened during that request, and see if anything stands out.

Here is a request id when i requested my site deployed in FRA from Sweden that took 7 seconds:

01FVZF5TM3CYT3QWFXQXX71DSF-fra

New certificates for apps we havenā€™t ā€œseenā€ before can be slow several times per region. Each of our edges keeps a local cache, once the cert is cached it should be very fast.

How many requests did you run that got 7s response times?

I just made another try and got 7,5 seconds, however request made right after is very fast, but then after a while of not requesting it will become slow again for the first request in a while.

01FVZHEBEEDFQX653YSF9ADMT8-fra

image

Weā€™re looking at this, our Vault cluster is slow to serve uncached certs, but after about 10 connections per region you shouldnā€™t ever notice that. This is good information, thank you.

Awesome, thank you! Just let me know if I can provide any more information.

This should be improving now, we discovered a few sources of slowness. Our Vault store is under heavy load and weā€™re working to scale it.

We also issue both RSA and ECDSA certificates for hostnames ā€“ many ECDSA certificates were behind, and we found that apps with only RSA certs would check vault for ECDSA certificates before handshaking. This is now fixed, once a cert is cached and fast we donā€™t wait on any vault lookups before the handshake happens.

Thank you so much! This is now solved for me :smiley:

1 Like

Ok thatā€™s great to hear. Sorry about that! And thank you for the debugging help.

Seems to be working just fine for me now as well, thank you for quick response and fix!

visit https://debug.fly.dev/ in your browser

Iā€™m in Seoul but getting Seattle. Would this be the same for all users in Korea?

Screen Shot 2022-02-16 at 7.22.13 PM

It might be the same for every client of the same ISP as you in Korea.

Can you provide a traceroute? traceroute debug.fly.dev

This will help us troubleshoot routing issues.

1 Like

OK, thanks, youā€™ll see my traceroute below.

I have a Phoenix LiveView app with region in NRT, secondary regions HKG and SIN. Typically, latency between Korea and Japan should be around 50-60ms, but Iā€™m getting websocket (LiveView) responses of something aroundā€¦ a second I think? Thanks for your help!

āžœ  ~ traceroute debug.fly.dev
traceroute to debug.fly.dev (77.83.140.164), 64 hops max, 52 byte packets
 1  192.168.45.1 (192.168.45.1)  10.621 ms  3.368 ms  3.130 ms
 2  218.51.63.1 (218.51.63.1)  4.543 ms  4.245 ms  4.259 ms
 3  100.79.40.137 (100.79.40.137)  4.817 ms  4.752 ms  4.575 ms
 4  10.45.253.190 (10.45.253.190)  4.658 ms
    10.45.254.240 (10.45.254.240)  4.700 ms
    10.45.254.20 (10.45.254.20)  4.612 ms
 5  10.222.35.86 (10.222.35.86)  4.809 ms
    10.222.25.54 (10.222.25.54)  4.910 ms
    10.222.25.60 (10.222.25.60)  5.183 ms
 6  1.255.76.109 (1.255.76.109)  5.344 ms  5.270 ms
    211.176.50.151 (211.176.50.151)  5.411 ms
 7  58.229.4.183 (58.229.4.183)  122.121 ms  122.465 ms
    58.229.4.181 (58.229.4.181)  125.489 ms
 8  * * *
 9  * * *
10  * 9.97.225.104.ptr.anycast.net (104.225.97.9)  130.904 ms *
11  * 9.97.225.104.ptr.anycast.net (104.225.97.9)  129.675 ms *
12  * * *
13  * * *
14  * * *
15  * * *
16  * * *
17  * * *
18  * * *
19  * * *
20  * * *
21  * * *
22  * * *
23  * * *
24  * * *
25  * * *
26  * * *
27  * * *
28  * * *
29  * * *
30  * * *
31  * * *
32  * * *
33  * * *
34  * * *
35  * * *
36  * * *
37  * * *
38  * * *
39  * * *
40  * * *
41  * * *
42  * * *
43  * * *
44  * * *
45  * * *
46  * * *
47  * * *
48  * * *
49  * * *
50  * * *
51  * * *
52  * * *
53  * * *
54  * * *
55  * * *
56  * * *
57  * * *
58  * * *
59  * * *
60  * * *
61  * * *
62  * * *
63  * * *
64  * * *
āžœ  ~