Fly prometheus metrics have been unreliable

I run grafana on-premises and query fly’s prometheus metrics to fire alerts.

I’ve set up a grafana alert on fly_instance_up to detect downtime and the alert has been going crazy today for some reason. This is my alert query:

But when I query the metric from the explore tab, the graph doesn’t look like that.

This isn’t the only instance of my alerts becoming unreliable. Frequently, I get alerted for DatasourceNoData or DatasourceError, and when I go try it manually, it’s perfectly fine. My internet connection isn’t the most reliable, so I can understand getting DatasourceError, but DatasourceNoData implies that the query was successful and there wasn’t anything there. Has anyone else encountered this? It’s really frustrating because I’ve put all this work into refining my alerts to be useful, and I can’t trust them anymore.

Hi @dyc3,

  1. Yes, our hosted metrics cluster suffered some capacity-related issues today: Fly.io Status - Metrics collection delays. This was causing some delays in collecting metrics so that the most recent data points would take a few minutes before being available to queries.

    We increased capacity which seemed to have helped somewhat, but there have been a few intermittent delays since. Anyway, we’re aware of the issues and working on it.

  2. If you just want the intermittent false-positives to stop bothering you, there are ways you could adjust your alert. You could either set the time-series range to a larger duration (say, 10-15 minutes) and a max expression so that any non-zero fly_instance_up metric over that window will satisfy the query; or, you could set the ‘no data’ behavior to ‘error’ (instead of ‘alerting’) and remove the or vector(0) from the query, in which case a missing metric will not fire an alert.

  3. In the alert query screenshot you shared, the bands at the left are just a sampling-rate artifact- with ‘max data points’ of 50000, the graph uses a minimum step of 1s which is too small for the 15-second scrape interval these platform metrics provide. You can fix this by changing ‘min step’ from auto to 15s to match the Prometheus scrape interval. (I’m pretty sure this is a Grafana Alerting bug- all other graphs automatically set the min step the configured scrape interval from the selected data source except for this one.)

Sorry for the inconvenience, and hope this info helps.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.