Strange P99 metrics

Not sure if this is an artifact of some kind or a real issue. This is the P99 graph by region.

I noticed ORD was at 15 seconds P99, couldn’t find anything wrong with it, restarted it and it looks like it recovered. However, then AMS and MAA now have started to act funny.

ORD has a P99 of 5s, then it went to 15s for a while.
AMS has a P99 of 30s .
MAA is all the way up at 120s .

This is not really making any sense to me. The instances are fine. There have been no deploys. There is no more than regular traffic patterns, etc. The server-side logs do not indicate any performance issues whatsoever. Neither the soft-limit nor hard-limit has been hit in the past 2 days.

Any ideas?

1 Like

The oddness continues. Not fixed after bouncing the machines in the regions.

1 Like

After some more debugging, these are not real metrics and appear to be happening on websocket connections.

Not sure if it’s all or some, etc. but it didn’t appear to happen on pages without websockets from my brief testing.

Hi,

I think I have seen the same issue for one of my apps in the Frankfurt region. It was running fine for about 2 weeks, but just before the weekend I started seeing critical healthchecks and very high edge response times. Restarting the app fixes it temporarily.

In my case, it applies to edge response time. App response times are good:

Edge:

App:

I have not yet figured out why this happens.

Are you using websockets at all?

You might want to try this query in the Prometheus explorer to narrow things down (replace YOUR_APP_NAME).

histogram_quantiles("p", 0.99, sum(increase(fly_app_http_response_time_seconds_bucket{app="YOUR_APP_NAME"}))by(le,region))

I’m only seeing this behaviour in Europe and India. It just seems rather odd to have two groupings. 5 seconds and 30 seconds.

What one would expect looks more evenly distributed without groupings (but even then 5 seconds is far too long).

I am not using websockets. I will try the query you suggested, thanks!

The confusion continues. It’s back to generally maxing out at 5 again.

Is it possible your apps are reaching their hard limits frequently? This would create a lot of retries from our proxy and show up as slow responses from the proxy, but fast responses from the app.

Nope, not even close to the soft-limit.

I think something changed on the grafana data as I can’t recreate the graph with P99 at 120 seconds and 30 seconds anymore. If I didn’t take the screenshot, I would have thought I had gone crazy.

Looking closely at the P99 data without those artifacts, the P99 is between 3-5s. This could be potentially due to reads/writes on the primary from edge locations, but that’s pretty long for a P99 time. That’s the only theory I have atm though.

We did notice yesterday that a few of our metrics servers were not in sync with the others. That can create weird graphs depending on which metric server you’d hit via our API.

Yah, that would explain why I would sometimes see different graphs when I hit refresh also.

Well, I’m going to flip over to litefs shortly and if this is due to db reads on the primary, the P99 will drop quite a bit. Guess we’ll find out. :slight_smile:

Past 24 hours with artifacts at 60 and 30 and 15 seconds depending on region. Did not see these at all until now and I have been watching the metrics a few times / day.

The previous screenshot taken Strange P99 metrics - #7 by tj1 does not have these artifacts at all.

Past 6 hours (no artifacts). Likely db related as the 5 second mark only occurs on sjc, sin, and dfw. From server logs there are requests that take 1s on a single page, so 5s seems a stretch, but plausible.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.