I noticed ORD was at 15 seconds P99, couldn’t find anything wrong with it, restarted it and it looks like it recovered. However, then AMS and MAA now have started to act funny.
ORD has a P99 of 5s, then it went to 15s for a while.
AMS has a P99 of 30s .
MAA is all the way up at 120s .
This is not really making any sense to me. The instances are fine. There have been no deploys. There is no more than regular traffic patterns, etc. The server-side logs do not indicate any performance issues whatsoever. Neither the soft-limit nor hard-limit has been hit in the past 2 days.
I think I have seen the same issue for one of my apps in the Frankfurt region. It was running fine for about 2 weeks, but just before the weekend I started seeing critical healthchecks and very high edge response times. Restarting the app fixes it temporarily.
In my case, it applies to edge response time. App response times are good:
I think something changed on the grafana data as I can’t recreate the graph with P99 at 120 seconds and 30 seconds anymore. If I didn’t take the screenshot, I would have thought I had gone crazy.
Looking closely at the P99 data without those artifacts, the P99 is between 3-5s. This could be potentially due to reads/writes on the primary from edge locations, but that’s pretty long for a P99 time. That’s the only theory I have atm though.
Past 6 hours (no artifacts). Likely db related as the 5 second mark only occurs on sjc, sin, and dfw. From server logs there are requests that take 1s on a single page, so 5s seems a stretch, but plausible.