Similar to the log shipper, are there any plans to provide a metrics shipper?
I need to store metrics in my Prometheus/Victoria or something so that I can store them for a longer time and also do alerting based on them.
I can clobber together some metrics for my app, but there are a ton of metrics related to the edge proxy that I can’t get access to from within the VMs.
Can I scrape your Prometheus or use your NATS infrastructure to get these metrics out of your platform?
HALP!
If you already have a working Prometheus/VictoriaMetrics cluster, you can totally slurp data from Fly.io Prometheus into it, using Prometheus’s “Federation” concept and facilities.
you’d probably have to set up “cross-service federation” to pull selected Fly.io metrics into your Prometheus cluster.
For Fly.io Prometheus, the endpoints are described here; you’ll notice the /federate endpoint is supported directly, so for the federation configuration you’d have to specify (I haven’t really tried it!):
targets: [ 'api.fly.io'] # may need to tweak the port etc
'metrics-path: '/prometheus/YOUR_ORG/federate
I think you’d want to set honor_timestamps: false.
and (crucially), you need to pass an authorization header, might look like this in Prometheus config for the scrape endpoint:
Here’s a potentially workable example (with the caveat: this was written for Open Telemetry Collector, which is based heavily on Prometheus):
scrape_configs:
- job_name: fly-collector # maybe not needed in prometheus
scrape_interval: 3s # Go as high as you can so you don't kill the api!
metrics_path: /prometheus/YOUR_ORG/api/v1/query
honor_labels: true
params:
"query":
- '{app=~"APP_YOU_WANT"}' # Only if you want to filter by app
scheme: https
static_configs:
- targets: ["api.fly.io"]
authorization:
credentials: ${env:FLY_METRICS_TOKEN}
Let me know if this points you in the right direction.
Oh you are okay with us scraping your API? That’s cool.
Would something like work with 3000 Apps with about 5000 machines backing them? I don’t know what sort of cardinality problems can come up. Do you test for Prometheus federation for this kind of a scale?
I checked with the team and it seems like it’s okay to scrape at that scale as long as you plan and time your query carefully to avoid unnecessary extra work.
A single scrape of /federate (without any app= label filter, see example below) should get all of their metrics in a single request. Also having the scrape interval as high as your needs can accommodate (5 minutes? 1 minute?) would make sense. What would not make sense is scraping more frequently than 20 seconds or so, as our own scraping interval is around that.
A more refined scrape-config example which I tested with Prometheus:
- job_name: 'flyprometheus'
scrape_interval: 30s # Go as high as you can so you don't kill the api!
metrics_path: /prometheus/personal/federate
honor_labels: true
params:
"match[]":
- '{__name__=~".+"}' # Matches everything, get all apps in one request
scheme: https
static_configs:
- targets: ["api.fly.io"]
authorization:
credentials: YEAH_MY_TOKEN_HERE
A thing I found is that environment variables are NOT supported in prometheus.yml, so you’ll have to either hard-code the token in the config file, or use credentials_file instead, and put the token in a separate file. (This is orthogonal to Fly.io specifics, there’s a lot of frustrating discussion as to why Prometheus does not support env variable substitution in config files, with the end result being that it’s that way by design and that’s not going to change - so alternatives need to be used as I mentioned).