Prometheus / Fly metrics getting dropped

I know the prometheus metrics is in early access, but I was wondering if there are any known ongoing issues with it. I’m noticing that a lot of data points seem to be getting dropped when displaying the data in grafana. A lot of my graphs end up looking like this:


This seems to be the case for both metrics generated by apps and for the default fly metrics. If I had to guess, the prometheus nodes aren’t keeping and are missing scrapes, sometimes for several minutes at a time. Looking back at the history, there seems to be occasional blips (which makes sense for an early access service), but it seems to have gotten significantly worse in all regions around 11am PST yesterday (November 15th):

1 Like

It does make keeping track of live nodes a bit difficult.

Also checking in with the same issue since about 24 hours ago (varying by a few hours depending on the app). In my case, it looks like the default fly metrics are flaky too, but custom metrics are just not showing up.

All my instances are in FRA, if that helps.

We have been furiously expanding this metrics cluster. The short answer on gaps like this is: Our metrics cluster is growing quickly and we aren’t ahead of it. We will charge for it when we’re comfy we know how to keep up, which will slow growth, but we’re not comfortable charging money for metrics at the current level of reliability.

For the moment the best advice I have is “hold tight”.

2 Likes

That makes sense. If we were to bring up our own prometheus instance on fly, is there a way to scrape the fly metrics?

The data should be available via the API: Hooking Up Fly Metrics · Fly — that doesn’t replace the Fly system, though, which I think is actually what you’re asking?

If you wanted to completely bypass Fly you could configure your applications to send data straight out to an external service or to an internal app using the app.internal endpoint.

You can scrape your own metrics, assuming you’re using custom metrics and your exporter is listening on the private IPv6 addresses. We don’t have Prometheus endpoints available for scraping, though.

That might be a good feature to add. One way we can improve the reliability of metrics for paid users would be to just run dedicated Victoria Metrics clusters.

1 Like

Seems to be working much better today!

1 Like

Yeah, if it’s going to be a paid feature, I’d be fine with it being treated similar to postgres where we just run our own instance as a fly app under our own organization.

As @charsleysa said, it does seem to be much more stable as of this morning. Glad to see it catching up to the demand :slight_smile:

@kurt Not a huge priority, but I wanted to mention that it looks like the metrics endpoints aren’t keeping up with data again. It looks like we started getting bits of missing data on the 27th, continuing through today.

1 Like