Metrics forwarding has worked perfectly for multiple weeks.
We receive metrics that both Fly itself exposes as well as our own custom metrics in our Grafana dashboards without problems.
Between 13:15:45CET and 13:23:45CET measurements are missing.
Between 13:23:45CET and 13:34:00CET measurements arrived again.
After 13:34:00CET no more measurements arrived.
(CET = Central European Time, UTC+1)
Even stranger is the view of another graph. Here you can see that measurements started coming in more and more infrequently, with the first hole happening around 12:01:45CET.
Is this impacting a particular app or region, or is it widespread throughout all apps / regions across your org? Are specific metrics impacted, or all metrics in the org (both internal/platform and custom metrics)? And just to confirm, the above Grafana screenshots are using Fly Prometheus as a datasource directly, not your own Prometheus getting metrics forwarded to it?
For one particular app. (that only runs instances in the ams region);
These screenshots are from Fly’s Grafana, using ‘Fly Prometheus’ as data source.
Strangely enough, the metrics worked again after a cluster reboot.
However, it has worked for days without problems.
And the fact that data started disappearing slowly rather than a full cutoff as can be seen from the ‘Jobs in queue’ graph and because the prometheus metrics page was still perfectly healthy when inspecting it using curl 0.0.0.:9394/metrics from within the instances makes me think that there really was something wrong in the metric forwarding infrastructure.
There don’t seem to be any issues with the metrics infrastructure, the issue seems to be somewhere in your application’s Ruby metrics collector. The /metrics endpoint continues responding but application metrics stop getting exported at some point. I noticed a metric ruby_collector_working starts flipping from 1 to 0 with the exact pattern your other metrics start disappearing, so that’s a lead you can follow.