Metrics: `HTTP STATUS CODES` values are confusing

Currently I’m seeing values in the metrics for HTTP STATUS CODES of 1.7 for HTTP 200 responses. There’s no details on what this means. 1.7k? 1.7 times the average (with no details on what the average is)?

Could the metrics for the response codes be tweaked to show the total number for a period? 15min blocks of time should be adequate. It’s more useful to see 1.4k 200 responses and 132 500 responses over 15mins than an arbitrary number.

Maybe it could even be split per resp. code with percentiles, in the same way the response time is done?

Alternatively, some help text to understand what the values represent. I’m assuming response time is in seconds, all the other charts are understandable, but the values for response codes is… cryptic.

We haven’t worked much on the metrics page lately, but we’ve made all its metrics (and more) available to everyone via a prometheus-compatible query interface. You can read more about it here: Feature preview: Custom metrics

To answer your specific questions:

  • Status codes are a rate over 15 seconds. So that’s the number of requests per second that returned the status code calculated over a window of 15s. You can sort of view this by clicking “View query” at the bottom right of the chart. This is also the exact query you can use in your own Grafana installation.
  • Response times are indeed in seconds.

Creating a good metrics interface has been a lot of work and part of the reason that we preferred offering raw access to querying it with other tools like Grafana.

1 Like

@jerome can you please share an example Promql query to get the HTTP status codes in count instead of rate/sec, still bucketed by code?

Sure, something like:

sum by (status)(fly_edge_http_responses_count{app=~"^$app$",region=~"^$region$",host=~"^$host$"})

This is assuming you’re using the provided dashboard with $app, $region and $host variables. If not you can use a static value in the query like: app="your-app-name".

The rate() here provided the per-seconds calculations, if you remove it you just get the count. The sum is still necessary because we have series for all regions and hosts.

Thank you, @jerome !

This gives the count over time, but not the count per interval. E.g. if you have 1 req/second and the interval is 30 seconds, you are getting 30, 60, 90..... What should be the query to get the count per interval instead?

Sorry if that’s a noob question, I am complete noob with Prometheus and Promql :smiley:

You can use increase() instead of rate().

Like:

sum by (status)(increase(fly_edge_http_responses_count{app=~"^$app$",region=~"^$region$",host=~"^$host$"})[30s])

This will give you the increase of responses in a 30s window.

2 Likes