healthcheck roundtrips spiked to >2000 ms at least a week ago

hi there,
i’m seeing something that sounds very similar to: Abnormally slow SSL handshake resulting in slow server responses across Fly apps?

i have the same simple remix app deployed to two different apps in ORD. up until recently response times for both have been similar and snappy for me (near LAX).

last week i noticed one of the apps flapping because of failed health checks and had to increase the grace period because they are now consistently > 2000ms, even after a machine restart.

https://court-dibs-59c4.fly.dev

2024-09-22T02:27:07.838 app[7815100b57e7e8] ord [info] GET /healthcheck 200 - - 2120.595 ms
2024-09-22T02:27:29.925 app[7815100b57e7e8] ord [info] HEAD / 200 - - 2081.199 ms
2024-09-22T02:27:29.961 app[7815100b57e7e8] ord [info] GET /healthcheck 200 - - 2109.783 ms
2024-09-22T02:27:52.047 app[7815100b57e7e8] ord [info] HEAD / 200 - - 2079.468 ms

meanwhile, the other app with identical specs has roundtrips under 700ms

https://court-dibs-59c4-staging.fly.dev

2024-09-22T02:31:36.317 app[28714d9a052d58] ord [info] GET /healthcheck 200 - - 678.646 ms

i’m not seeing anything suspicious when running traceroute for IPv4 and comparing the two

➜  court-dibs git:(main) traceroute court-dibs-59c4.fly.dev
traceroute to court-dibs-59c4.fly.dev (66.241.125.138), 64 hops max, 40 byte packets
 1  192.168.68.1 (192.168.68.1)  4.157 ms  4.674 ms  5.275 ms
 2  10.82.126.1 (10.82.126.1)  11.037 ms  13.680 ms  17.757 ms
 3  100.120.105.164 (100.120.105.164)  18.235 ms  13.215 ms  10.375 ms
 4  100.120.104.16 (100.120.104.16)  11.565 ms  17.691 ms  11.392 ms
 5  langbprj01-ae1.rd.la.cox.net (68.1.1.13)  18.159 ms  14.269 ms *
 6  be-200-pe11.600wseventh.ca.ibone.comcast.net (50.248.118.9)  18.459 ms  14.844 ms  14.733 ms
 7  be-3412-pe12.600wseventh.ca.ibone.comcast.net (96.110.33.78)  15.813 ms  16.017 ms  14.591 ms
 8  75.149.231.130 (75.149.231.130)  15.510 ms  17.167 ms  63.721 ms
 9  * * *

➜  court-dibs git:(main) ✗ traceroute court-dibs-59c4-staging.fly.dev
traceroute to court-dibs-59c4-staging.fly.dev (66.241.124.196), 64 hops max, 40 byte packets
 1  192.168.68.1 (192.168.68.1)  7.097 ms  6.587 ms  4.613 ms
 2  10.82.126.1 (10.82.126.1)  16.893 ms  11.916 ms  11.645 ms
 3  100.120.105.164 (100.120.105.164)  17.799 ms  12.520 ms  11.485 ms
 4  100.120.104.16 (100.120.104.16)  12.829 ms  13.588 ms  23.572 ms
 5  * langbprj01-ae1.rd.la.cox.net (68.1.1.13)  18.073 ms *
 6  be-200-pe11.600wseventh.ca.ibone.comcast.net (50.248.118.9)  17.243 ms  14.263 ms  14.472 ms
 7  be-3312-pe12.600wseventh.ca.ibone.comcast.net (96.110.33.74)  14.606 ms  16.768 ms  24.680 ms
 8  75.149.231.130 (75.149.231.130)  15.884 ms  18.072 ms  16.282 ms
 9  * * *

i’d love to be able to provide an IPv6 traceroute but i suspect my own ISP doesn’t allow me to make direct IPv6 connections.

➜  court-dibs git:(main) ✗ traceroute6 2a09:8280:1::39:d0d:0
connect: No route to host

Hi… I see a similar lag when trying from a machine in ord, :turtle:, so I don’t think it’s your ISP:

$ echo "$FLY_REGION"
ord

$ time curl -i 'https://court-dibs-59c4.fly.dev/' > /dev/null
real    0m2.485s
user    0m0.020s
sys     0m0.000s

$ time curl -i 'https://court-dibs-59c4.fly.dev/healthcheck' > /dev/null
real    0m2.607s
user    0m0.019s
sys     0m0.000s

In general, the health check should be a request directly from the Fly.io infrastructure to your 7815100b57e7e8 machine, so your local California network isn’t a likely suspect, anyway.

Are you using shared CPUs on Fly? Those can have very uneven response times.

From General to JavaScript

Added metrics

thanks for the reply @mayailurus.

Are you using shared CPUs on Fly?

i just upgraded the court-dibs-59c4 machine to ‘performance-1x’ and the roundtrips displayed in the live logs for the application are exactly the same.

2024-09-22T14:05:39.965 app[7815100b57e7e8] ord [info] HEAD / 200 - - 2452.327 ms
2024-09-22T14:05:40.015 app[7815100b57e7e8] ord [info] GET /healthcheck 200 - - 2497.243 ms

i wasn’t clear in my initial post, but I am providing information directly from the logs because I agree with you that it rules out my own ISP.

1 Like

Hm… What is that particular /healthcheck route doing in detail? E.g., is it talking to a database at all? The 700ms on the other machine is also rather slow…

You might try looking at RAM and swap consumption, as well, which is another old troubleshooting standby.

(And the forum’s JS experts may have additional suggestions…)

turns out i was doing some overly expensive SSR computations on the homepage of my site and the discrepancy between the two apps was explained entirely by the fact that the staging site db query returned far fewer rows.

thanks so much for helping me narrow down the cause of the issue. i was way off base initially.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.