Strange P99 metrics

tj1 · April 15, 2023, 2:05am

Not sure if this is an artifact of some kind or a real issue. This is the P99 graph by region.

I noticed ORD was at 15 seconds P99, couldn’t find anything wrong with it, restarted it and it looks like it recovered. However, then AMS and MAA now have started to act funny.

ORD has a P99 of 5s, then it went to 15s for a while.
AMS has a P99 of 30s .
MAA is all the way up at 120s .

This is not really making any sense to me. The instances are fine. There have been no deploys. There is no more than regular traffic patterns, etc. The server-side logs do not indicate any performance issues whatsoever. Neither the soft-limit nor hard-limit has been hit in the past 2 days.

Any ideas?

tj1 · April 16, 2023, 11:47am

The oddness continues. Not fixed after bouncing the machines in the regions.

tj1 · April 16, 2023, 11:47pm

After some more debugging, these are not real metrics and appear to be happening on websocket connections.

Not sure if it’s all or some, etc. but it didn’t appear to happen on pages without websockets from my brief testing.

p-tek · April 17, 2023, 3:13pm

Hi,

I think I have seen the same issue for one of my apps in the Frankfurt region. It was running fine for about 2 weeks, but just before the weekend I started seeing critical healthchecks and very high edge response times. Restarting the app fixes it temporarily.

In my case, it applies to edge response time. App response times are good:

Edge:

App:

I have not yet figured out why this happens.

tj1 · April 17, 2023, 4:05pm

Are you using websockets at all?

You might want to try this query in the Prometheus explorer to narrow things down (replace YOUR_APP_NAME).

histogram_quantiles("p", 0.99, sum(increase(fly_app_http_response_time_seconds_bucket{app="YOUR_APP_NAME"}))by(le,region))

I’m only seeing this behaviour in Europe and India. It just seems rather odd to have two groupings. 5 seconds and 30 seconds.

What one would expect looks more evenly distributed without groupings (but even then 5 seconds is far too long).

p-tek · April 18, 2023, 7:00am

I am not using websockets. I will try the query you suggested, thanks!

tj1 · April 20, 2023, 1:31am

The confusion continues. It’s back to generally maxing out at 5 again.

jerome · April 20, 2023, 12:21pm

Is it possible your apps are reaching their hard limits frequently? This would create a lot of retries from our proxy and show up as slow responses from the proxy, but fast responses from the app.

tj1 · April 20, 2023, 1:27pm

Nope, not even close to the soft-limit.

I think something changed on the grafana data as I can’t recreate the graph with P99 at 120 seconds and 30 seconds anymore. If I didn’t take the screenshot, I would have thought I had gone crazy.

Looking closely at the P99 data without those artifacts, the P99 is between 3-5s. This could be potentially due to reads/writes on the primary from edge locations, but that’s pretty long for a P99 time. That’s the only theory I have atm though.

jerome · April 20, 2023, 1:28pm

We did notice yesterday that a few of our metrics servers were not in sync with the others. That can create weird graphs depending on which metric server you’d hit via our API.

tj1 · April 20, 2023, 1:32pm

Yah, that would explain why I would sometimes see different graphs when I hit refresh also.

Well, I’m going to flip over to litefs shortly and if this is due to db reads on the primary, the P99 will drop quite a bit. Guess we’ll find out.

tj1 · April 21, 2023, 10:30am

Past 24 hours with artifacts at 60 and 30 and 15 seconds depending on region. Did not see these at all until now and I have been watching the metrics a few times / day.

The previous screenshot taken Strange P99 metrics - #7 by tj1 does not have these artifacts at all.

Past 6 hours (no artifacts). Likely db related as the 5 second mark only occurs on sjc, sin, and dfw. From server logs there are requests that take 1s on a single page, so 5s seems a stretch, but plausible.

system · April 28, 2023, 10:31am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Poor edge response time in MIA region	2	332	February 19, 2022
Help me understand metrics and issues	8	889	April 7, 2023
understanding spikes in P99	18	1713	July 21, 2021
Timeout on new connections	8	418	August 3, 2022
Rendering of metrics is wrong for some longer duration graphs	1	629	September 16, 2020

Strange P99 metrics

Related topics