Currently I’m seeing values in the metrics for HTTP STATUS CODES of 1.7 for HTTP 200 responses. There’s no details on what this means. 1.7k? 1.7 times the average (with no details on what the average is)?
Could the metrics for the response codes be tweaked to show the total number for a period? 15min blocks of time should be adequate. It’s more useful to see 1.4k 200 responses and 132 500 responses over 15mins than an arbitrary number.
Maybe it could even be split per resp. code with percentiles, in the same way the response time is done?
Alternatively, some help text to understand what the values represent. I’m assuming response time is in seconds, all the other charts are understandable, but the values for response codes is… cryptic.
We haven’t worked much on the metrics page lately, but we’ve made all its metrics (and more) available to everyone via a prometheus-compatible query interface. You can read more about it here: Feature preview: Custom metrics
To answer your specific questions:
Status codes are a rate over 15 seconds. So that’s the number of requests per second that returned the status code calculated over a window of 15s. You can sort of view this by clicking “View query” at the bottom right of the chart. This is also the exact query you can use in your own Grafana installation.
Response times are indeed in seconds.
Creating a good metrics interface has been a lot of work and part of the reason that we preferred offering raw access to querying it with other tools like Grafana.
sum by (status)(fly_edge_http_responses_count{app=~"^$app$",region=~"^$region$",host=~"^$host$"})
This is assuming you’re using the provided dashboard with $app, $region and $host variables. If not you can use a static value in the query like: app="your-app-name".
The rate() here provided the per-seconds calculations, if you remove it you just get the count. The sum is still necessary because we have series for all regions and hosts.
This gives the count over time, but not the count per interval. E.g. if you have 1 req/second and the interval is 30 seconds, you are getting 30, 60, 90..... What should be the query to get the count per interval instead?
Sorry if that’s a noob question, I am complete noob with Prometheus and Promql
I don’t think so this is correct, and also not helpful, I would like to see the count of each status code.
Here’s my code for 404 - and I see the 404s in the logs and the 404 status code when it returns to the client. But the metrics shows them successful 200?
Those are rates, so a single 404 request might not show up with a significant digit here. The fact that it shows 0 is probably because it recorded something but our UI trims too many fractional digits, rounding to 0. We should probably change that.
Metrics help give and idea of “trends”. If you need precision, then you probably want to look into “tracing” as a concept.
You could also use increase instead of rate to see “how many http status code responses since the last interval”. If you don’t want to deal with Grafana, then this is not currently feasible.
I’ve been wondering if there’s a practical, useful, way to display basic metrics in the terminal, via flyctl