I noticed ORD was at 15 seconds P99, couldn’t find anything wrong with it, restarted it and it looks like it recovered. However, then AMS and MAA now have started to act funny.
ORD has a P99 of 5s, then it went to 15s for a while.
AMS has a P99 of 30s .
MAA is all the way up at 120s .
This is not really making any sense to me. The instances are fine. There have been no deploys. There is no more than regular traffic patterns, etc. The server-side logs do not indicate any performance issues whatsoever. Neither the soft-limit nor hard-limit has been hit in the past 2 days.
I think I have seen the same issue for one of my apps in the Frankfurt region. It was running fine for about 2 weeks, but just before the weekend I started seeing critical healthchecks and very high edge response times. Restarting the app fixes it temporarily.
In my case, it applies to edge response time. App response times are good:
Is it possible your apps are reaching their hard limits frequently? This would create a lot of retries from our proxy and show up as slow responses from the proxy, but fast responses from the app.
I think something changed on the grafana data as I can’t recreate the graph with P99 at 120 seconds and 30 seconds anymore. If I didn’t take the screenshot, I would have thought I had gone crazy.
Looking closely at the P99 data without those artifacts, the P99 is between 3-5s. This could be potentially due to reads/writes on the primary from edge locations, but that’s pretty long for a P99 time. That’s the only theory I have atm though.
We did notice yesterday that a few of our metrics servers were not in sync with the others. That can create weird graphs depending on which metric server you’d hit via our API.
Past 24 hours with artifacts at 60 and 30 and 15 seconds depending on region. Did not see these at all until now and I have been watching the metrics a few times / day.
Past 6 hours (no artifacts). Likely db related as the 5 second mark only occurs on sjc, sin, and dfw. From server logs there are requests that take 1s on a single page, so 5s seems a stretch, but plausible.