I’m not quite sure I understand how to read the metrics in the Dashboard and Grafana, as I can’t get them to make sense. In the Dashboard, sometimes I see HTTP response times measured in seconds. When I click the “Grafana” button to open the metrics in Grafana, the response times are much lower (usually < 500 milliseconds). Is there a bug in the Dashboard that inflates the response times, or are these different queries? This usually happens when monitoring reports that one or more of the app instances are in a critical state.
I just had a case of the Dashboard showing 3 seconds for the 0.5, < 500ms for 0.95 (!) and over 5 seconds for 0.99. It looks wrong, and also completely different from the Grafana stats for the same timespan.
I also don’t fully get why my apps go critical so often. I assume Fly reports them as critical when the health check is not responding, but I can see from my application logs that the health check endpoint is responding as usual (and it is responding in less than 1 millisecond - just a simple 200 OK). I just had two of three nodes go critical. In the logs, I could see health checks being performed. When I restarted the apps, they became healthy again. There are no errors in the app logs. Memory usage is far below max for the instances. Dedicated CPU VMs.
It looks to me like the issues are outside the application, but if anybody has experienced the same type of behavior from their apps and managed to fix it, I’d be happy to learn about it.
Hey! If you select Fly Edge (top right in Grafana) > HTTP on Grafana do those look closer to what you’re seeing on the dashboard?
We have different metrics for various app lifecycle and edge request/response lifecycle. Edge response time is basically app response time + the networking/proxying to respond from our edge nodes.
Clicking “Show Queries” on the Fly.o dashboard metrics will show the query we run to get those. You should be able to inspect the JSON for the Grafana dashboard panels as well if you’re interested in comparing closer.
I checked the Fly Edge metrics, and these match what I see on the Dashboard with the long response times. Does this mean that the slowdown is happening in the network, before/after the request hits the app?