Prometheus Federation - worked for 2-3 days, now it's failing

I got Prometheus federation to work by using api.fly.io/prometheus/<< org >>/federate as the endpoint to scrape and using my org token for authorization credentials.

This worked well for the last 2-3 days but started to fail about 2-3h ago, it now returns a "400 Bad Request".

I saw other topics suggesting that federation wasn’t supported in the past but figured I’d try anyway and was positively surprised that it worked.
Did I just got lucky or ran into a test that was reverted?

Being able to alert on fly_instances_up from my existing setup and getting all the metrics from my local prom server were both very convenient.

Hi @oliver1

I am not sure the specific of what you’re trying to do with Prometheus but another customer previously wrote a prometheus exporter for fly which you might find useful.

Thanks Rahmat. I’m trying to pull all metrics out of the fly.io prometheus instance, including the pg_* metrics, fly_* metrics (a lot more than just if an instance is up, also CPU, etc) plus all custom app metrics that fly scrapes for me.
The exporter only solves a tiny part of that, the majority of interesting metrics I can’t get that way.

Hi @oliver1,
The /federate endpoint should work! For a general reference, we currently expose all of the Prometheus querying API endpoints supported by VictoriaMetrics, which includes /federate.
The issue you saw was caused by some metrics-cluster changes yesterday (adding/removing storage nodes) that unintentionally caused this to stop working. I’ve fixed the issue so this endpoint should be working again now.

3 Likes

Can confirm, it’s working again, thanks for fixing this!

@wjordan - is this broken again? Seeing errors since around 8am UTC

hey @oliver1 are you still seeing these error messages?

We had a Anycast UDP outage yesterday that may have been causing the errors you were seeing but that has been resolved

Nope, still down as of right now.

Hi @oliver1, thanks for reporting the issue, the federate endpoint was indeed broken and should now be working again, sorry for the inconvenience.

More detail- one of the servers in the metrics cluster rebooted and failed to rejoin the cluster cleanly. Although metrics are replicated to multiple servers so normal queries were unaffected by a single server going offline, the federate endpoint implementation is particularly fragile (it prioritizes consistency over availability) and won’t return its data when even a single storage node is offline. Bringing the server back online fixed the issue, and we’ll be looking into ways to make this setup more reliable moving forward.

1 Like

Can confirm, works again. Thx @wjordan !

@wjordan - this is broken again? I’m seeing errors starting last night.

Thanks for the heads up! Another server rebooted unexpectedly, should be up and running again now. Sorry about the interruption!

No worries, looks fine now, thanks!