Custom metric forwarding seems to be broken since 12:30 UTC

qqwy · November 30, 2022, 1:56pm

Metrics forwarding has worked perfectly for multiple weeks.
We receive metrics that both Fly itself exposes as well as our own custom metrics in our Grafana dashboards without problems.

Until today.

Between 13:15:45CET and 13:23:45CET measurements are missing.
Between 13:23:45CET and 13:34:00CET measurements arrived again.
After 13:34:00CET no more measurements arrived.
(CET = Central European Time, UTC+1)

Even stranger is the view of another graph. Here you can see that measurements started coming in more and more infrequently, with the first hole happening around 12:01:45CET.

I have SSH’d to the instances and confirmed that the metrics page exposed by prometheus at 0.0.0.9394/metrics is still fully operational.

Therefore I expect that the problem is somewhere in Fly’s metrics forwarding infrastructure.
Could you investigate what is going on?

wjordan · November 30, 2022, 6:06pm

Is this impacting a particular app or region, or is it widespread throughout all apps / regions across your org? Are specific metrics impacted, or all metrics in the org (both internal/platform and custom metrics)? And just to confirm, the above Grafana screenshots are using Fly Prometheus as a datasource directly, not your own Prometheus getting metrics forwarded to it?

qqwy · December 1, 2022, 11:18am

Thank you, great questions!

Only the custom metrics were impacted;
For one particular app. (that only runs instances in the ams region);
These screenshots are from Fly’s Grafana, using ‘Fly Prometheus’ as data source.

Strangely enough, the metrics worked again after a cluster reboot.
However, it has worked for days without problems.
And the fact that data started disappearing slowly rather than a full cutoff as can be seen from the ‘Jobs in queue’ graph and because the prometheus metrics page was still perfectly healthy when inspecting it using curl 0.0.0.:9394/metrics from within the instances makes me think that there really was something wrong in the metric forwarding infrastructure.

wjordan · December 1, 2022, 5:50pm

There don’t seem to be any issues with the metrics infrastructure, the issue seems to be somewhere in your application’s Ruby metrics collector. The /metrics endpoint continues responding but application metrics stop getting exported at some point. I noticed a metric ruby_collector_working starts flipping from 1 to 0 with the exact pattern your other metrics start disappearing, so that’s a lead you can follow.

qqwy · December 2, 2022, 7:32pm

Thank you very much for digging deep into this issue! We will investigate it on our end.

Topic		Replies	Views
Prometheus not collecting or exposing metrics? Questions / Help grafana	6	2713	September 26, 2022
Lack of Prometheus metrics, sometimes, after deploy (host specific?)	20	2291	June 22, 2022
Prometheus / Fly metrics getting dropped	9	489	December 29, 2021
Feature preview: Custom metrics beta	35	4246	January 4, 2023
Custom Metrics/Prometheus etc. Can not find custom variables in fly grafana dashboard Phoenix metrics	0	120	June 20, 2024

Custom metric forwarding seems to be broken since 12:30 UTC

Related topics