Metrics: `HTTP STATUS CODES` values are confusing

OldhamMade · May 25, 2021, 10:02am

Currently I’m seeing values in the metrics for HTTP STATUS CODES of 1.7 for HTTP 200 responses. There’s no details on what this means. 1.7k? 1.7 times the average (with no details on what the average is)?

Could the metrics for the response codes be tweaked to show the total number for a period? 15min blocks of time should be adequate. It’s more useful to see 1.4k 200 responses and 132 500 responses over 15mins than an arbitrary number.

Maybe it could even be split per resp. code with percentiles, in the same way the response time is done?

Alternatively, some help text to understand what the values represent. I’m assuming response time is in seconds, all the other charts are understandable, but the values for response codes is… cryptic.

jerome · May 25, 2021, 12:22pm

We haven’t worked much on the metrics page lately, but we’ve made all its metrics (and more) available to everyone via a prometheus-compatible query interface. You can read more about it here: Feature preview: Custom metrics

To answer your specific questions:

Status codes are a rate over 15 seconds. So that’s the number of requests per second that returned the status code calculated over a window of 15s. You can sort of view this by clicking “View query” at the bottom right of the chart. This is also the exact query you can use in your own Grafana installation.
Response times are indeed in seconds.

Creating a good metrics interface has been a lot of work and part of the reason that we preferred offering raw access to querying it with other tools like Grafana.

enstyled · September 14, 2021, 12:57pm

@jerome can you please share an example Promql query to get the HTTP status codes in count instead of rate/sec, still bucketed by code?

jerome · September 14, 2021, 1:01pm

Sure, something like:

sum by (status)(fly_edge_http_responses_count{app=~"^$app$",region=~"^$region$",host=~"^$host$"})

This is assuming you’re using the provided dashboard with $app, $region and $host variables. If not you can use a static value in the query like: app="your-app-name".

The rate() here provided the per-seconds calculations, if you remove it you just get the count. The sum is still necessary because we have series for all regions and hosts.

enstyled · September 14, 2021, 1:21pm

Thank you, @jerome !

This gives the count over time, but not the count per interval. E.g. if you have 1 req/second and the interval is 30 seconds, you are getting 30, 60, 90..... What should be the query to get the count per interval instead?

Sorry if that’s a noob question, I am complete noob with Prometheus and Promql

jerome · September 14, 2021, 1:39pm

You can use increase() instead of rate().

Like:

sum by (status)(increase(fly_edge_http_responses_count{app=~"^$app$",region=~"^$region$",host=~"^$host$"})[30s])

This will give you the increase of responses in a 30s window.

ht1 · February 27, 2022, 1:11am

Would be nice to add more metrics on dashboard. Don’t want to deal with Grafana & Prometheus

Currently what I would like to have:

Switching between rate / count of HTTP status group by
- List of the urls (especially not found ones - I know I can do that on my own via logging but would be nice to see on dashboard as aggregated)
Maybe response times from seconds to milliseconds (Some frameworks serves requests in nanoseconds these days (ex: Elixir))

ht1 · February 28, 2022, 12:08pm

I don’t think so this is correct, and also not helpful, I would like to see the count of each status code.

Here’s my code for 404 - and I see the 404s in the logs and the 404 status code when it returns to the client. But the metrics shows them successful 200?

  app.use(function (req: Request, res: Response) {
    log.warn(
      { method: req.method, url: req.url, status: 404, body: req.body },
      "404 - NOT FOUND"
    );
    return res.status(404).send({ message: "The Route NOT FOUND!" });
  });

jerome · February 28, 2022, 12:45pm

Those are rates, so a single 404 request might not show up with a significant digit here. The fact that it shows 0 is probably because it recorded something but our UI trims too many fractional digits, rounding to 0. We should probably change that.

Metrics help give and idea of “trends”. If you need precision, then you probably want to look into “tracing” as a concept.

You could also use increase instead of rate to see “how many http status code responses since the last interval”. If you don’t want to deal with Grafana, then this is not currently feasible.

I’ve been wondering if there’s a practical, useful, way to display basic metrics in the terminal, via flyctl

johan · June 18, 2022, 12:53am

@jerome For inspiration: GitHub - slok/grafterm: Metrics dashboards on terminal (a grafana inspired terminal version)

and few more, for the rusticans among us:

Topic		Replies	Views
http error code metrics metrics	1	430	January 4, 2023
Grafana guide? Questions / Help	3	284	March 21, 2023
can i modify the Grafana dashboard to view HTTP response times to a lower granularity Questions / Help metrics , grafana	4	513	May 4, 2024
Help me understand metrics and issues	8	870	April 7, 2023
units for HTTP RESPONSE TIMES	2	340	July 14, 2021

Metrics: `HTTP STATUS CODES` values are confusing

Related topics