Lack of Prometheus metrics, sometimes, after deploy (host specific?)

Sometimes after a deploy it appears that the Prometheus metrics stop, redeploying appears to solve the problem (as yet I haven’t tried just a restart to see if this makes a difference).

Not sure if it’s a problem on specific hosts? My deploys thus far have only been in lhr, and my app scale count is 1.

On a previous occasion I left it ~30mins before a redeploy and it didn’t resolve itself so I don’t think it’s just a question of waiting.

Note: the problem is that the metrics aren’t being requested (i.e. nothing inbound from Fly on 9091), as opposed to issues processing the metrics requests/data (either within the app or Fly). It’s also unrelated to my other Grafana Cloud/Prometheus thread; this specific problem I think has been occuring for a while, as opposed to the other thread which I only noticed on Friday.

1 Like

No joy with restarts fixing the issue (I assume a restart remains on the same host?). I also tried a number of deploys (on Saturday evening) to no avail - I still don’t have Prometheus metrics from the app in lhr.

I’ve since added the lax region and despite numerous deploys - 1xlax (e.g. host 3944) = Prometheus metrics ok, 1xlhr = no Prometheus metrics. My other apps in lhr have Prometheus metrics but they may well stop post-deploy.

Fly: would it be possible to provide a FLY_HOST environment variable with the host’s ID?

I believe the host detail is currently only available via Prometheus metrics(?) and if they aren’t working, per this thread, then it’s not possible for this to be determined?

Also facing this problem, and it only seems to be happening with new deploys (got an app that hasn’t been updated in a while and it’s cranking out metrics just fine). Every time I make a deployment, the metrics start working again for a short time (around 30 minutes or thereabouts) and then just go silent.

(noting this is happening with multiple apps in fra)

I’ve hopefully fixed this particular issue with metrics. Our collector (vector) appeared to have stopped reloading it’s configs on a few hosts. I’ve given it a kick across the fleet. Can you let me know if things have improved?

Yep, the metrics are rolling in again for me, thank you!

@steveberryman I believe I’ve also lost my Prometheus metrics? I’ll redeploy to see if it fixes

Metrics now appearing for me (both pre-existing started to work and also still ok post-deploy), thanks.

Sashaafm - I didn’t need to redeploy, they started to appear again after Steve’s vector intervention.

Avinashbot - have you tried another deploy in fra? as your problem, stopping ~30mins after a deploy (which I haven’t experienced), may still exist.

Steve - Do you think there is any possibility of a FLY_HOST environment variable being added at some point in future (e.g. it would have made it easy for me to provide some hosts that had been working vs. those that hadn’t)?. Also, do you happen to use Grafana and, if possible, please could you try clicking Explore with a Fly Prometheus source and/or load the Fly dashboard (see other thread)?

I redeployed but I still don’t have metrics in the Prometheus instance :thinking:

The curl seems to working correctly:

curl https://api.fly.io/prometheus/<SLUG>/api/v1/query_range\?step\=30 \
    --data-urlencode 'query=sum(rate(fly_edge_http_responses_count{app="$APP"}[5m])) by (status)' \
    -H "Authorization: Bearer <TOKEN>"

{"status":"success","isPartial":false,"data":{"resultType":"matrix","result":[]}}%

However, I can’t seem to find any metrics in my Grafana (it was previously working correctly).

The Data Source tests okay in Grafana’s UI.

EDIT: looks like not even the default prometheus_* metrics appear

sashaafm - The root cause of your problem may (TBC) be related to my Grafana Cloud Fly dashboard/Prometheus issue? thread. As opposed to the vector metrics-lack-of-polling/collection being reported above.

Correction :cry: … After Steve’s intervention my lhr metrics started working automatically (lax was already ok), although I’ve only just noticed after my deploy earlier lax stopped being polled for metrics.

vector may still have post-deploy problems (on some hosts?).

@steveberryman I’m having a similar issues with 2 recently deployed apps, one is running in AMS, other running AMS, LAX and SIN. None of the instances are being polled for metrics, all of them have a custom metrics endpoint.

I do see metrics in dashboard, and when pulling the API but these are only fly metrics. Checking http requests for the instances I’m not seeing any requests for /metrics come in.

Yeah, checking in again, custom metrics seem to be gone for me on FRA (immediately after deployment this time). Tried redeploying a few times in FRA, but gave up and switched to AMS, and then CDG, where it finally started working for me. Seems like the issue is still present.

I just deployed a brand new app and grafana instance to LAX and I am not seeing any custom metrics. I only see the basic fly_ metrics. SSHing into the instance, the metrics endpoint is alive and working. I also don’t see any incoming requests to /metrics in the logs.

I moved my app instance from LAX to SEA and custom metrics began flowing in. It seems like LAX isn’t collecting custom metrics right now, as the app never receives pings for the /metrics route. Any user action that can be taken to fix this?

I don’t believe the problem is region-specific. I do have Prometheus metrics from an app in LAX (on host 3944). However, with numerous deploys in LAX I’d say it was ~50% chance of Prometheus metrics working/starting to be polled by Fly.

For LHR I have multiple apps on host 701b that are working, which if I deploy again I’ve no doubt I’ll lose metrics from those :cry: . Recent LHR deploys I’ve not had much joy with metrics being polled.

I can’t tell you which host aren’t currently working as that information is not available (Feature Request: FLY_HOST environment variable).

Looking into the issue affecting custom-metrics scraping today.

1 Like

As an update, I managed to narrow down the underlying issue to a bug in Vector. For anyone interested in following along, I filed an issue upstream and we will work on fixing this bug to make this feature more reliable.

In the meantime, I’ve given Vector another kick across the fleet so custom metrics should be working everywhere again for now.

It is working across my apps now. Thank you!

@wjordan Voted and subscribed on the issue. I can report that all custom metrics are back. Thanks much!

I’m still getting custom metrics from AMS and HKG but not in from IAD or SJC on one of my apps. The app exposes a /metrics endpoint configured in fly.toml and used to collect metrics just fine.

Anything I can do to help troubleshoot?