I got Prometheus federation to work by using api.fly.io/prometheus/<< org >>/federate as the endpoint to scrape and using my org token for authorization credentials.
This worked well for the last 2-3 days but started to fail about 2-3h ago, it now returns a "400 Bad Request".
I saw other topics suggesting that federation wasn’t supported in the past but figured I’d try anyway and was positively surprised that it worked.
Did I just got lucky or ran into a test that was reverted?
Being able to alert on fly_instances_up from my existing setup and getting all the metrics from my local prom server were both very convenient.
I am not sure the specific of what you’re trying to do with Prometheus but another customer previously wrote a prometheus exporter for fly which you might find useful.
Thanks Rahmat. I’m trying to pull all metrics out of the fly.io prometheus instance, including the pg_* metrics, fly_* metrics (a lot more than just if an instance is up, also CPU, etc) plus all custom app metrics that fly scrapes for me.
The exporter only solves a tiny part of that, the majority of interesting metrics I can’t get that way.
Hi @oliver1,
The /federate endpoint should work! For a general reference, we currently expose all of the Prometheus querying API endpoints supported by VictoriaMetrics, which includes /federate.
The issue you saw was caused by some metrics-cluster changes yesterday (adding/removing storage nodes) that unintentionally caused this to stop working. I’ve fixed the issue so this endpoint should be working again now.
Hi @oliver1, thanks for reporting the issue, the federate endpoint was indeed broken and should now be working again, sorry for the inconvenience.
More detail- one of the servers in the metrics cluster rebooted and failed to rejoin the cluster cleanly. Although metrics are replicated to multiple servers so normal queries were unaffected by a single server going offline, the federate endpoint implementation is particularly fragile (it prioritizes consistency over availability) and won’t return its data when even a single storage node is offline. Bringing the server back online fixed the issue, and we’ll be looking into ways to make this setup more reliable moving forward.