The Prometheus metrics API endpoint at https://api.fly.io/prometheus/ has been returning 503 errors for the past ~15 minutes. I haven’t seen an update on the status page. Is there a current incident?
We were alerted by Pagerduty due to some monitors setup on our metrics.
Checking back here, we were alerted again and we are noticing 503’s again on the metrics query endpoint for the last 45 minutes. The status page is still green.
Is there a better way to export these metrics from Fly? When Fly scrapes the metrics can we configure an endpoint or queue to write the metrics onto? We could setup our own Prom to scrape our custom application metrics, but it seems like we would be missing the Fly internal ones. Any suggestions here would be great.
Hi @mheffner, apologies for the ongoing trouble with the intermittent metrics-query errors. The API was affected by a recurring denial-of-service attack that was causing the metrics endpoint to become overloaded until we intervened.
Today we deployed some mitigations that make the endpoint better protected against the particular kind of attack we observed, so hopefully the query API reliability should be improved moving forward.
Querying the public API at api.fly.io/prometheus is the only way to query built-in metrics collected from our platform. There isn’t any better way currently, but I appreciate your feedback on this as it’s something we hope to improve soon.
Thanks for the work to improve this and the detailed update @wjordan . We’ll observe on our end for any more issues, but it seems to be better recently!