Feature preview: Custom metrics

It adds (further) complexity, but probably the best option is to run a third process to merge the exporters metrics. Vector, for example, could ingest metrics from both exporters as a prometheus scrape source, and then have the prometheus exporter sink configured to output the merged metrics. You’d then use this url/port in the metrics block.

Thanks @steveberryman! I found: GitHub - rebuy-de/exporter-merger: Merges Prometheus metrics from multiple sources and that seems to work great. Been able to merge both apache and php-fpm and output them.

Vector would probably be the better choice. The merger binary is a little bit smaller in size, trying to keep the size of the VM down.

1 Like

Resurrecting this thread…

Any thoughts on using ChaosSearch as the back-end for logs/metrics?

@jerome Picking up my work on Grafana and Prometeus where I left it off last year.

Quick question, I’m running 2 apps with volumes attached by don’t seem to be able to find fly_volume_size_bytes as documented here: Metrics on Fly Any ideas?

This seems related to a bug that’s been happening for a while. Sounds like you might be the only user of this metric!

I’m working on a fix, it might take a little bit.

Thanks Jerome, no pressure. I’m building up a dashboard and wanted to add a gauge to expose volume info. Happy to contribute it back.

Since I have your attention two more related questions:

  1. You are exposing a proxy_id label with values green and blue. What does this mean?

  2. What does fly_app_concurrency represent?

proxy_id - is to make sure the counters reset when we deploy or else it leads to some weird metrics. It’s possible we can have 2 proxies running concurrently while we gracefully shut down connections during a deploy.

fly_app_concurrency - this is the current number of connections (or requests, depending on your concurrency config) established to your app instance

2 Likes

Thanks!

Last question (I hope) related to metrics, does fly-cache-status Cache Hits in Metrics require the HTTP handler, or does it work regardless? I have a custom cache status header i can change, would allow me to easily expose a cache hit ratio in Grafana.

It does require the HTTP handler, we don’t parse HTTP responses headers when the handler isn’t setup.

I’ve done some work on the volume metrics. Today I’m going to do some more and try to resolve this situation!

@johan volume metrics should now be back.

1 Like

Thanks @jerome. Works great.

Edit: I have two apps, one staging single instance and one production with 2 instances. I"m getting fly_volume_used_pct data for the single instance app, not for the multi instance app. Query is basic:

Works:

max by(region) (fly_volume_used_pct{app=~"jt-web-staging", region=~".*", host=~".*"})

Fails:

max by(region) (fly_volume_used_pct{app=~"jt-web-production", region=~".*", host=~".*"})

Any ideas?

@jerome Not sure if you saw the edit to my reply. Just a quick nudge just in case.

Hi, I am having an issue where my custom metrics are not appearing in managed Grafana or Prometheus.

/metrics on port 9091 as defined in fly.toml is definitely getting hit every 15s, but no new metrics have appeared in the metrics browser, nor when I query the prometheus API at ‘https://api.fly.io/prometheus/ORG_NAME/api/v1/label/__name__/values’.
This is an example line from the output at the app metrics endpoint ‘my_metric_name{process=“myAppName”} 525447’.

The app in question is a multi-process app, where the metrics endpoint is only exposed for one process out of two.

Would appreciate any advice you can offer for debugging this issue. Many thanks.

Are you metrics server bound to 0.0.0.0? This is required to access them from our hosts.

1 Like

Thanks for the reply! The server was bound correctly, there was just an issue with my custom exporter implementation which I have now resolved. Would be cool if scraping errors could be exposed somehow. If anyone else has this issue then promtool which ships with prometheus is really useful for diagnosing these sort of issues. E.g. curl -s http://0.0.0.0:9091/metrics | promtool check metric

2 Likes

hihi :wave: not sure if this is the right place to spam but here goes: I wanted to try Fly.io metrics out. I think I’ve got a service correctly deployed + barfing metrics and my fly.toml is configured correctly.

My problem is this link doesn’t work: Sign In · Fly I created this org less than an hour ago.

I can query the Prometheus metrics for this org:

home-cluster/headscale [main●] » TOKEN=$(flyctl auth token)

home-cluster/headscale [main●] » ORG_SLUG=samcday-headscale
home-cluster/headscale [main●] » curl https://api.fly.io/prometheus/$ORG_SLUG/api/v1/query \
  --data-urlencode 'query=sum(increase(fly_edge_http_responses_count)) by (app, status)' \
  -H "Authorization: Bearer $TOKEN"

{"status":"success","isPartial":false,"data":{"resultType":"vector","result":[{"metric":{"app":"samcday-headscale","status":"200"},"value":[1665233758,"2"]}]}}%    

Yes, this is something we’re working on!

There’s a known issue where new/updated orgs aren’t synchronized with the hosted Grafana instance (it only happens on a session refresh), so if you’re already signed in it will take a couple hours for it to get updated. We’re working on a fix but in the meantime, you can manually refresh your session by going to fly-metrics.net/logout which will force an update.

2 Likes

All of fly-metrics.net seems to be down, including when linked directly from the fancy new button on the dashboard:

❯ curl -vvv 'https://fly-metrics.net/'
*   Trying 168.220.81.101:443...
* Connected to fly-metrics.net (168.220.81.101) port 443 (#0)
* ALPN: offers h2
* ALPN: offers http/1.1
*  CAfile: /etc/ssl/cert.pem
*  CApath: none
* (304) (OUT), TLS handshake, Client hello (1):
* error:02FFF036:system library:func(4095):Connection reset by peer
* Closing connection 0
curl: (35) error:02FFF036:system library:func(4095):Connection reset by peer
❯ curl -vvv 'http://fly-metrics.net/'
*   Trying 168.220.81.101:80...
* Connected to fly-metrics.net (168.220.81.101) port 80 (#0)
> GET / HTTP/1.1
> Host: fly-metrics.net
> User-Agent: curl/7.84.0
> Accept: */*
> 
* Recv failure: Connection reset by peer
* Closing connection 0
curl: (56) Recv failure: Connection reset by peer

The managed Grafana instance is currently running on a host in ewr that had a failure this morning: Fly.io Status - Host failure in EWR

It looks recovered at this point, Fly Metrics’ own metrics are showing an impact between 2022-10-28T17:14:00Z2022-10-28T19:58:00Z.

We plan to eventually make this service more reliable with a replicated database cluster but haven’t gotten to that point yet, which is why it was affected by this single-host issue. Sorry for the inconvenience!