Prometheus Federation - worked for 2-3 days, now it's failing

oliver1 · July 21, 2022, 1:36am

I got Prometheus federation to work by using api.fly.io/prometheus/<< org >>/federate as the endpoint to scrape and using my org token for authorization credentials.

This worked well for the last 2-3 days but started to fail about 2-3h ago, it now returns a "400 Bad Request".

I saw other topics suggesting that federation wasn’t supported in the past but figured I’d try anyway and was positively surprised that it worked.
Did I just got lucky or ran into a test that was reverted?

Being able to alert on fly_instances_up from my existing setup and getting all the metrics from my local prom server were both very convenient.

rahmatjunaid · July 21, 2022, 2:48pm

Hi @oliver1

I am not sure the specific of what you’re trying to do with Prometheus but another customer previously wrote a prometheus exporter for fly which you might find useful.

oliver1 · July 21, 2022, 3:42pm

Thanks Rahmat. I’m trying to pull all metrics out of the fly.io prometheus instance, including the pg_* metrics, fly_* metrics (a lot more than just if an instance is up, also CPU, etc) plus all custom app metrics that fly scrapes for me.
The exporter only solves a tiny part of that, the majority of interesting metrics I can’t get that way.

wjordan · July 21, 2022, 4:48pm

Hi @oliver1,
The /federate endpoint should work! For a general reference, we currently expose all of the Prometheus querying API endpoints supported by VictoriaMetrics, which includes /federate.
The issue you saw was caused by some metrics-cluster changes yesterday (adding/removing storage nodes) that unintentionally caused this to stop working. I’ve fixed the issue so this endpoint should be working again now.

oliver1 · July 22, 2022, 12:29am

Can confirm, it’s working again, thanks for fixing this!

oliver1 · August 11, 2022, 12:33am

@wjordan - is this broken again? Seeing errors since around 8am UTC

rahmatjunaid · August 11, 2022, 12:23pm

hey @oliver1 are you still seeing these error messages?

We had a Anycast UDP outage yesterday that may have been causing the errors you were seeing but that has been resolved

oliver1 · August 11, 2022, 3:18pm

Nope, still down as of right now.

wjordan · August 11, 2022, 5:04pm

Hi @oliver1, thanks for reporting the issue, the federate endpoint was indeed broken and should now be working again, sorry for the inconvenience.

More detail- one of the servers in the metrics cluster rebooted and failed to rejoin the cluster cleanly. Although metrics are replicated to multiple servers so normal queries were unaffected by a single server going offline, the federate endpoint implementation is particularly fragile (it prioritizes consistency over availability) and won’t return its data when even a single storage node is offline. Bringing the server back online fixed the issue, and we’ll be looking into ways to make this setup more reliable moving forward.

oliver1 · August 12, 2022, 3:23am

Can confirm, works again. Thx @wjordan !

oliver1 · October 12, 2022, 6:54pm

@wjordan - this is broken again? I’m seeing errors starting last night.

wjordan · October 12, 2022, 7:24pm

Thanks for the heads up! Another server rebooted unexpectedly, should be up and running again now. Sorry about the interruption!

oliver1 · October 13, 2022, 1:38am

No worries, looks fine now, thanks!

Topic		Replies	Views
Prometheus metrics - any way to federate/scrape it? metrics	5	1430	October 7, 2022
Prometheus API currently 503-ing metrics	6	385	November 24, 2023
Does Fly's Prometheus API endpoint support federation? `remote_read`? Questions / Help metrics	7	1284	April 24, 2024
Prometheus API and metrics are currently 503-ing	6	443	October 21, 2023
fly-metrics.net & Prometheus issue?	6	472	October 21, 2022

Prometheus Federation - worked for 2-3 days, now it's failing

Related topics