Grafana Cloud Fly dashboard/Prometheus issue?

Will you paste the curl output? What response are you actually getting?

curl 'https://api.fly.io/prometheus/avantgarde-finance/api/v1/status/buildinfo' --header 'Authorization: Bearer <ACESS TOKEN>' 
remoteAddr: "10.123.11.76:53346", X-Forwarded-For: "84.115.209.22, 77.83.143.220, 213.188.208.17, 205.234.149.66"; requestURI: /select/36393/prometheus/api/v1/status/buildinfo; unsupported path requested: "/select/36393/prometheus/api/v1/status/buildinfo"

Status code is 400

Try this one? That buildinfo path is not supported in VictoriaMetrics (what we use under the covers). It’s normal for that to error. This should return something though:

curl 'https://api.fly.io/prometheus/fly/api/v1/label/region/values' --header "Authorization: Bearer $(fly auth token)"

That works but it’s not the path that Grafana is trying to call it seems. I’m fairly confident that it worked a few days / weeks ago.

That’s the path to the first error. Grafana is showing a 401, though, which is what you get when you’re not authenticated. Can you try re-inputing your auth token in the Grafana source?

Just did. The metrics and my dashboards work but the query builder & explorer doesn’t work (same error). So the auth token works for scraping the metrics but not for the other endpoints?

Btw. the endpoint it’s trying (and failing) to call when loading the list of available metrics in the Explorer / Inspector is

https://avantgarde.grafana.net/api/datasources/32/resources/api/v1/label/__name__/values?start=1655222080&end=1655225680

And that returns a 401 despite the rest of the metrics working via the same datasource.

See if this works for you over curl?

curl 'https://api.fly.io/prometheus/avantgarde-finance/api/v1/label/__name__/values' --header "Authorization: Bearer $(fly auth token)" -D -
1 Like

Not sure if related, but I stopped getting metrics from AMS and SJC at 9:20am PT today although the instances are still healthy and serving traffic. My project’s other region (HKG) is still getting metrics in Grafana. I have a separate project in AMS (same organization) that still has metrics.

Edit: fly restart fixed it, but a bit worrisome that metrics just stopped from some instances.

Having now reverse proxied the Grafana Cloud>Fly API Prometheus requests, I believe I can see the problem… Grafana Cloud does send the authorization: header when clicking Save & Test on the Prometheus source. It doesn’t however send the authorization: header when fetching the list of Metrics with Explore (and as a result Fly responds with a HTTP 401).

If, on the reverse proxy, I force the authorization: Bearer <fly auth token> into the requests - it then works (HTTP 200 from Fly).

With no knowledge of what is “normal” for Prometheus requests (i.e. are some available without authentication?) I don’t know if a request (without the authorization: header) for:

GET /prometheus/<orgname>/api/v1/rules

Would normally be responded to with a HTTP 200.

If Prometheus requests, or at least some paths, are usually allowed (or have previously been allowed by Fly, i.e. prior to late last week) - without an authorization: header - then this may be resolvable by Fly.

If however authorization: has always been required by Fly for all Prometheus paths - then I can only assume Grafana Cloud have made a breaking change :cry: .

Yep. Looks like a breaking change by Grafana Cloud. I’ve submitted a support ticket with Grafana Labs and will report back here once they reply.

Confirmed by Grafana Labs support:

Thank you for contacting Grafana Labs Support.

My name is Jay, and I am the support engineer assigned to assist you with your Ticket.

Based on 401 Auth error , it matched to our recent Grafana 9.0.0 upgrade causing sudden 401 auth errors for Prometheus bug.

Our engineering team is working on the fix, if you would like we can roll back to 8.5.5 as a temp solution

1 Like

Can you see anywhere that the “recent Grafana 9.0.0 upgrade” was announced/documented?

I note in the Support page it says:

NOTE: Before you open a support ticket for a service problem, check status.grafana.com to see if there are any known issues.

Looks at https://status.grafana.com/ … yep, nothing mentioned.

Welcome to the Cloud; have a status page but don’t update it (a recurring theme).

I mean… It’s surprising that such a breaking change / bug happened in the first place. Especially for a software product that is about metrics, logging and monitoring. During the rollout of this upgrade they should’ve seen a massive spike in error logs and reverted the rollout immediately. I’d guess nearly everyone using Grafana Cloud consumes one or multiple Prometheus data source(s). So yeah, it’s definitely surprising that such an obvious bug wasn’t caught during testing and then also made it past the initial rollout without reverting it.

EDIT: I’ve asked Grafana Labs to comment on this too (how this happened in the first place and how it went unnoticed for so long and didn’t cause them to roll back to the previous version)

1 Like

Should they roll it back to 8.5.5 and if during the problem period, on 9.0, the user(s) had (understandably/as attempted above) tried updating their Prometheus source’s Authorization details: Release notes for Grafana 9.0.0 | Grafana documentation

Any secret (data sources credential, alert manager credential, etc, etc) created or modified with Grafana v9.0 won’t be decryptable from any previous version (by default) because the way encrypted secrets are stored into the database has changed. Although secrets created or modified with previous versions will still be decryptable by Grafana v9.0.

They may (TBC) need to update them again post-downgrade.

1 Like

@Whistler so is everyone awaiting for said patch to fix this issue? Can we manually downgrade Grafana Cloud to 8.5.5?

I’ve assumed any roll back (per tenant?) would require intervention by Grafana, as I’m not sure if it’s possible for a Grafana Cloud user to do this(?).

As with Fly’s Prometheus metrics issue (with vector), I guess it’ll eventually be fixed and I’ll just wait for the resolution.

Looks like Grafana Cloud has been updated to v9.0.1-b253e87pre (b253e87d7) but the error still persists.

1 Like

It appears that the latest update to Grafana Cloud v9.0.2-83956baf (c29f1c44c) has fixed it.

Yep. Also fixed for me.