Metrics forwarding has worked perfectly for multiple weeks.
We receive metrics that both Fly itself exposes as well as our own custom metrics in our Grafana dashboards without problems.
- Between 13:15:45CET and 13:23:45CET measurements are missing.
- Between 13:23:45CET and 13:34:00CET measurements arrived again.
- After 13:34:00CET no more measurements arrived.
(CET = Central European Time, UTC+1)
Even stranger is the view of another graph. Here you can see that measurements started coming in more and more infrequently, with the first hole happening around 12:01:45CET.
I have SSH’d to the instances and confirmed that the metrics page exposed by prometheus at
0.0.0.9394/metrics is still fully operational.
Therefore I expect that the problem is somewhere in Fly’s metrics forwarding infrastructure.
Could you investigate what is going on?
Is this impacting a particular app or region, or is it widespread throughout all apps / regions across your org? Are specific metrics impacted, or all metrics in the org (both internal/platform and custom metrics)? And just to confirm, the above Grafana screenshots are using Fly Prometheus as a datasource directly, not your own Prometheus getting metrics forwarded to it?
Thank you, great questions!
- Only the custom metrics were impacted;
- For one particular app. (that only runs instances in the
- These screenshots are from Fly’s Grafana, using ‘Fly Prometheus’ as data source.
Strangely enough, the metrics worked again after a cluster reboot.
However, it has worked for days without problems.
And the fact that data started disappearing slowly rather than a full cutoff as can be seen from the ‘Jobs in queue’ graph and because the prometheus metrics page was still perfectly healthy when inspecting it using
curl 0.0.0.:9394/metrics from within the instances makes me think that there really was something wrong in the metric forwarding infrastructure.
There don’t seem to be any issues with the metrics infrastructure, the issue seems to be somewhere in your application’s Ruby metrics collector. The
/metrics endpoint continues responding but application metrics stop getting exported at some point. I noticed a metric
ruby_collector_working starts flipping from
0 with the exact pattern your other metrics start disappearing, so that’s a lead you can follow.
Thank you very much for digging deep into this issue! We will investigate it on our end.