Metrics: `HTTP STATUS CODES` values are confusing

Currently I’m seeing values in the metrics for HTTP STATUS CODES of 1.7 for HTTP 200 responses. There’s no details on what this means. 1.7k? 1.7 times the average (with no details on what the average is)?

Could the metrics for the response codes be tweaked to show the total number for a period? 15min blocks of time should be adequate. It’s more useful to see 1.4k 200 responses and 132 500 responses over 15mins than an arbitrary number.

Maybe it could even be split per resp. code with percentiles, in the same way the response time is done?

Alternatively, some help text to understand what the values represent. I’m assuming response time is in seconds, all the other charts are understandable, but the values for response codes is… cryptic.

We haven’t worked much on the metrics page lately, but we’ve made all its metrics (and more) available to everyone via a prometheus-compatible query interface. You can read more about it here: Feature preview: Custom metrics

To answer your specific questions:

  • Status codes are a rate over 15 seconds. So that’s the number of requests per second that returned the status code calculated over a window of 15s. You can sort of view this by clicking “View query” at the bottom right of the chart. This is also the exact query you can use in your own Grafana installation.
  • Response times are indeed in seconds.

Creating a good metrics interface has been a lot of work and part of the reason that we preferred offering raw access to querying it with other tools like Grafana.

2 Likes

@jerome can you please share an example Promql query to get the HTTP status codes in count instead of rate/sec, still bucketed by code?

Sure, something like:

sum by (status)(fly_edge_http_responses_count{app=~"^$app$",region=~"^$region$",host=~"^$host$"})

This is assuming you’re using the provided dashboard with $app, $region and $host variables. If not you can use a static value in the query like: app="your-app-name".

The rate() here provided the per-seconds calculations, if you remove it you just get the count. The sum is still necessary because we have series for all regions and hosts.

Thank you, @jerome !

This gives the count over time, but not the count per interval. E.g. if you have 1 req/second and the interval is 30 seconds, you are getting 30, 60, 90..... What should be the query to get the count per interval instead?

Sorry if that’s a noob question, I am complete noob with Prometheus and Promql :smiley:

You can use increase() instead of rate().

Like:

sum by (status)(increase(fly_edge_http_responses_count{app=~"^$app$",region=~"^$region$",host=~"^$host$"})[30s])

This will give you the increase of responses in a 30s window.

2 Likes

Would be nice to add more metrics on dashboard. Don’t want to deal with Grafana & Prometheus :slight_smile:

Currently what I would like to have:

  • Switching between rate / count of HTTP status group by
    • List of the urls (especially not found ones - I know I can do that on my own via logging but would be nice to see on dashboard as aggregated)
  • Maybe response times from seconds to milliseconds (Some frameworks serves requests in nanoseconds these days (ex: Elixir))

I don’t think so this is correct, and also not helpful, I would like to see the count of each status code.

Here’s my code for 404 - and I see the 404s in the logs and the 404 status code when it returns to the client. But the metrics shows them successful 200? :thinking:

  app.use(function (req: Request, res: Response) {
    log.warn(
      { method: req.method, url: req.url, status: 404, body: req.body },
      "404 - NOT FOUND"
    );
    return res.status(404).send({ message: "The Route NOT FOUND!" });
  });

Those are rates, so a single 404 request might not show up with a significant digit here. The fact that it shows 0 is probably because it recorded something but our UI trims too many fractional digits, rounding to 0. We should probably change that.

Metrics help give and idea of “trends”. If you need precision, then you probably want to look into “tracing” as a concept.

You could also use increase instead of rate to see “how many http status code responses since the last interval”. If you don’t want to deal with Grafana, then this is not currently feasible.

I’ve been wondering if there’s a practical, useful, way to display basic metrics in the terminal, via flyctl :thinking:

1 Like

@jerome For inspiration: GitHub - slok/grafterm: Metrics dashboards on terminal (a grafana inspired terminal version)

and few more, for the rusticans among us: