Feature preview: Custom metrics

Thanks @steveberryman! I found: GitHub - rebuy-de/exporter-merger: Merges Prometheus metrics from multiple sources and that seems to work great. Been able to merge both apache and php-fpm and output them.

Vector would probably be the better choice. The merger binary is a little bit smaller in size, trying to keep the size of the VM down.

1 Like

Resurrecting this thread…

Any thoughts on using ChaosSearch as the back-end for logs/metrics?

@jerome Picking up my work on Grafana and Prometeus where I left it off last year.

Quick question, I’m running 2 apps with volumes attached by don’t seem to be able to find fly_volume_size_bytes as documented here: Metrics on Fly Any ideas?

This seems related to a bug that’s been happening for a while. Sounds like you might be the only user of this metric!

I’m working on a fix, it might take a little bit.

Thanks Jerome, no pressure. I’m building up a dashboard and wanted to add a gauge to expose volume info. Happy to contribute it back.

Since I have your attention two more related questions:

  1. You are exposing a proxy_id label with values green and blue. What does this mean?

  2. What does fly_app_concurrency represent?

proxy_id - is to make sure the counters reset when we deploy or else it leads to some weird metrics. It’s possible we can have 2 proxies running concurrently while we gracefully shut down connections during a deploy.

fly_app_concurrency - this is the current number of connections (or requests, depending on your concurrency config) established to your app instance

3 Likes

Thanks!

Last question (I hope) related to metrics, does fly-cache-status Cache Hits in Metrics require the HTTP handler, or does it work regardless? I have a custom cache status header i can change, would allow me to easily expose a cache hit ratio in Grafana.

It does require the HTTP handler, we don’t parse HTTP responses headers when the handler isn’t setup.

I’ve done some work on the volume metrics. Today I’m going to do some more and try to resolve this situation!

@johan volume metrics should now be back.

1 Like

Thanks @jerome. Works great.

Edit: I have two apps, one staging single instance and one production with 2 instances. I"m getting fly_volume_used_pct data for the single instance app, not for the multi instance app. Query is basic:

Works:

max by(region) (fly_volume_used_pct{app=~"jt-web-staging", region=~".*", host=~".*"})

Fails:

max by(region) (fly_volume_used_pct{app=~"jt-web-production", region=~".*", host=~".*"})

Any ideas?

@jerome Not sure if you saw the edit to my reply. Just a quick nudge just in case.

Hi, I am having an issue where my custom metrics are not appearing in managed Grafana or Prometheus.

/metrics on port 9091 as defined in fly.toml is definitely getting hit every 15s, but no new metrics have appeared in the metrics browser, nor when I query the prometheus API at ‘https://api.fly.io/prometheus/ORG_NAME/api/v1/label/__name__/values’.
This is an example line from the output at the app metrics endpoint ‘my_metric_name{process=“myAppName”} 525447’.

The app in question is a multi-process app, where the metrics endpoint is only exposed for one process out of two.

Would appreciate any advice you can offer for debugging this issue. Many thanks.

Are you metrics server bound to 0.0.0.0? This is required to access them from our hosts.

1 Like

Thanks for the reply! The server was bound correctly, there was just an issue with my custom exporter implementation which I have now resolved. Would be cool if scraping errors could be exposed somehow. If anyone else has this issue then promtool which ships with prometheus is really useful for diagnosing these sort of issues. E.g. curl -s http://0.0.0.0:9091/metrics | promtool check metric

2 Likes

hihi :wave: not sure if this is the right place to spam but here goes: I wanted to try Fly.io metrics out. I think I’ve got a service correctly deployed + barfing metrics and my fly.toml is configured correctly.

My problem is this link doesn’t work: Sign In · Fly I created this org less than an hour ago.

I can query the Prometheus metrics for this org:

home-cluster/headscale [main●] » TOKEN=$(flyctl auth token)

home-cluster/headscale [main●] » ORG_SLUG=samcday-headscale
home-cluster/headscale [main●] » curl https://api.fly.io/prometheus/$ORG_SLUG/api/v1/query \
  --data-urlencode 'query=sum(increase(fly_edge_http_responses_count)) by (app, status)' \
  -H "Authorization: Bearer $TOKEN"

{"status":"success","isPartial":false,"data":{"resultType":"vector","result":[{"metric":{"app":"samcday-headscale","status":"200"},"value":[1665233758,"2"]}]}}%    

Yes, this is something we’re working on!

There’s a known issue where new/updated orgs aren’t synchronized with the hosted Grafana instance (it only happens on a session refresh), so if you’re already signed in it will take a couple hours for it to get updated. We’re working on a fix but in the meantime, you can manually refresh your session by going to fly-metrics.net/logout which will force an update.

2 Likes

All of fly-metrics.net seems to be down, including when linked directly from the fancy new button on the dashboard:

❯ curl -vvv 'https://fly-metrics.net/'
*   Trying 168.220.81.101:443...
* Connected to fly-metrics.net (168.220.81.101) port 443 (#0)
* ALPN: offers h2
* ALPN: offers http/1.1
*  CAfile: /etc/ssl/cert.pem
*  CApath: none
* (304) (OUT), TLS handshake, Client hello (1):
* error:02FFF036:system library:func(4095):Connection reset by peer
* Closing connection 0
curl: (35) error:02FFF036:system library:func(4095):Connection reset by peer
❯ curl -vvv 'http://fly-metrics.net/'
*   Trying 168.220.81.101:80...
* Connected to fly-metrics.net (168.220.81.101) port 80 (#0)
> GET / HTTP/1.1
> Host: fly-metrics.net
> User-Agent: curl/7.84.0
> Accept: */*
> 
* Recv failure: Connection reset by peer
* Closing connection 0
curl: (56) Recv failure: Connection reset by peer

The managed Grafana instance is currently running on a host in ewr that had a failure this morning: Fly.io Status - Host failure in EWR

It looks recovered at this point, Fly Metrics’ own metrics are showing an impact between 2022-10-28T17:14:00Z2022-10-28T19:58:00Z.

We plan to eventually make this service more reliable with a replicated database cluster but haven’t gotten to that point yet, which is why it was affected by this single-host issue. Sorry for the inconvenience!

Currently, it seems like the prometheus collector does not support openmetrics and more specifically their info type. When using this type (as defined as Info in the prometheus_client official rust crate) no metrics appear on the hosted grafana instance (the metrics fail to scrape). When any reference to this Info type is removed from the returned metrics, then the metrics scraping succeeds.

It would be nice for

  1. fly to somehow convey errors from attempts to take metrics to the user, either through fly log or on the dashboard somehow.
  2. and/or support for openmetrics/the info type to be added to the metrics scraper

as for 2, from my testing (if the content type is returned as application/openmetrics-text) the latest version of victoriametrics supports this type/openmetrics