Lack of Prometheus metrics, sometimes, after deploy (host specific?)

Whistler · June 11, 2022, 11:11am

Sometimes after a deploy it appears that the Prometheus metrics stop, redeploying appears to solve the problem (as yet I haven’t tried just a restart to see if this makes a difference).

Not sure if it’s a problem on specific hosts? My deploys thus far have only been in lhr, and my app scale count is 1.

On a previous occasion I left it ~30mins before a redeploy and it didn’t resolve itself so I don’t think it’s just a question of waiting.

Note: the problem is that the metrics aren’t being requested (i.e. nothing inbound from Fly on 9091), as opposed to issues processing the metrics requests/data (either within the app or Fly). It’s also unrelated to my other Grafana Cloud/Prometheus thread; this specific problem I think has been occuring for a while, as opposed to the other thread which I only noticed on Friday.

Whistler · June 13, 2022, 10:13am

No joy with restarts fixing the issue (I assume a restart remains on the same host?). I also tried a number of deploys (on Saturday evening) to no avail - I still don’t have Prometheus metrics from the app in lhr.

I’ve since added the lax region and despite numerous deploys - 1xlax (e.g. host 3944) = Prometheus metrics ok, 1xlhr = no Prometheus metrics. My other apps in lhr have Prometheus metrics but they may well stop post-deploy.

Fly: would it be possible to provide a FLY_HOST environment variable with the host’s ID?

I believe the host detail is currently only available via Prometheus metrics(?) and if they aren’t working, per this thread, then it’s not possible for this to be determined?

avinashbot · June 13, 2022, 1:45pm

Also facing this problem, and it only seems to be happening with new deploys (got an app that hasn’t been updated in a while and it’s cranking out metrics just fine). Every time I make a deployment, the metrics start working again for a short time (around 30 minutes or thereabouts) and then just go silent.

(noting this is happening with multiple apps in fra)

steveberryman · June 13, 2022, 10:40pm

I’ve hopefully fixed this particular issue with metrics. Our collector (vector) appeared to have stopped reloading it’s configs on a few hosts. I’ve given it a kick across the fleet. Can you let me know if things have improved?

avinashbot · June 14, 2022, 6:54am

Yep, the metrics are rolling in again for me, thank you!

sashaafm · June 14, 2022, 10:53am

@steveberryman I believe I’ve also lost my Prometheus metrics? I’ll redeploy to see if it fixes

Whistler · June 14, 2022, 11:06am

Metrics now appearing for me (both pre-existing started to work and also still ok post-deploy), thanks.

Sashaafm - I didn’t need to redeploy, they started to appear again after Steve’s vector intervention.

Avinashbot - have you tried another deploy in fra? as your problem, stopping ~30mins after a deploy (which I haven’t experienced), may still exist.

Steve - Do you think there is any possibility of a FLY_HOST environment variable being added at some point in future (e.g. it would have made it easy for me to provide some hosts that had been working vs. those that hadn’t)?. Also, do you happen to use Grafana and, if possible, please could you try clicking Explore with a Fly Prometheus source and/or load the Fly dashboard (see other thread)?

sashaafm · June 14, 2022, 11:36am

I redeployed but I still don’t have metrics in the Prometheus instance

The curl seems to working correctly:

curl https://api.fly.io/prometheus/<SLUG>/api/v1/query_range\?step\=30 \
    --data-urlencode 'query=sum(rate(fly_edge_http_responses_count{app="$APP"}[5m])) by (status)' \
    -H "Authorization: Bearer <TOKEN>"

{"status":"success","isPartial":false,"data":{"resultType":"matrix","result":[]}}%

However, I can’t seem to find any metrics in my Grafana (it was previously working correctly).

The Data Source tests okay in Grafana’s UI.

EDIT: looks like not even the default prometheus_* metrics appear

Whistler · June 14, 2022, 11:48am

sashaafm - The root cause of your problem may (TBC) be related to my Grafana Cloud Fly dashboard/Prometheus issue? thread. As opposed to the vector metrics-lack-of-polling/collection being reported above.

Whistler · June 14, 2022, 12:35pm

Correction … After Steve’s intervention my lhr metrics started working automatically (lax was already ok), although I’ve only just noticed after my deploy earlier lax stopped being polled for metrics.

vector may still have post-deploy problems (on some hosts?).

johan · June 18, 2022, 1:43am

@steveberryman I’m having a similar issues with 2 recently deployed apps, one is running in AMS, other running AMS, LAX and SIN. None of the instances are being polled for metrics, all of them have a custom metrics endpoint.

I do see metrics in dashboard, and when pulling the API but these are only fly metrics. Checking http requests for the instances I’m not seeing any requests for /metrics come in.

avinashbot · June 19, 2022, 5:54pm

Yeah, checking in again, custom metrics seem to be gone for me on FRA (immediately after deployment this time). Tried redeploying a few times in FRA, but gave up and switched to AMS, and then CDG, where it finally started working for me. Seems like the issue is still present.

bcomnes · June 20, 2022, 2:56pm

I just deployed a brand new app and grafana instance to LAX and I am not seeing any custom metrics. I only see the basic fly_ metrics. SSHing into the instance, the metrics endpoint is alive and working. I also don’t see any incoming requests to /metrics in the logs.

bcomnes · June 20, 2022, 4:55pm

I moved my app instance from LAX to SEA and custom metrics began flowing in. It seems like LAX isn’t collecting custom metrics right now, as the app never receives pings for the /metrics route. Any user action that can be taken to fix this?

Whistler · June 20, 2022, 5:08pm

I don’t believe the problem is region-specific. I do have Prometheus metrics from an app in LAX (on host 3944). However, with numerous deploys in LAX I’d say it was ~50% chance of Prometheus metrics working/starting to be polled by Fly.

For LHR I have multiple apps on host 701b that are working, which if I deploy again I’ve no doubt I’ll lose metrics from those . Recent LHR deploys I’ve not had much joy with metrics being polled.

I can’t tell you which host aren’t currently working as that information is not available (Feature Request: FLY_HOST environment variable).

wjordan · June 20, 2022, 5:18pm

Looking into the issue affecting custom-metrics scraping today.

wjordan · June 20, 2022, 10:00pm

As an update, I managed to narrow down the underlying issue to a bug in Vector. For anyone interested in following along, I filed an issue upstream and we will work on fixing this bug to make this feature more reliable.

In the meantime, I’ve given Vector another kick across the fleet so custom metrics should be working everywhere again for now.

bcomnes · June 21, 2022, 2:53pm

It is working across my apps now. Thank you!

johan · June 21, 2022, 5:01pm

@wjordan Voted and subscribed on the issue. I can report that all custom metrics are back. Thanks much!

ian1 · June 22, 2022, 4:59pm

I’m still getting custom metrics from AMS and HKG but not in from IAD or SJC on one of my apps. The app exposes a /metrics endpoint configured in fly.toml and used to collect metrics just fine.

Anything I can do to help troubleshoot?

Topic		Replies	Views
Prometheus not collecting or exposing metrics? Questions / Help grafana	6	2701	September 26, 2022
Prometheus API and metrics are currently 503-ing	6	448	October 21, 2023
Prometheus / Fly metrics getting dropped	9	486	December 29, 2021
Dashboard metrics are not available	12	785	December 4, 2023
fly-metrics.net & Prometheus issue?	6	486	October 21, 2022

Lack of Prometheus metrics, sometimes, after deploy (host specific?)

Related topics