Restarting Based on Metrics

shishir_j1 · May 19, 2025, 8:38pm

Hello Fly.io community,

I’m running a python application on Fly and have noticed that high concurrency metrics in Grafana often indicate that my application is hanging. I’d like to set up an automated system that can detect when this metric exceeds a threshold and trigger a restart of the affected machine.

What I’ve explored so far:

I’ve looked into the Fly Machines API which seems to offer machine start/stop (could be used to restart?) capabilities
I’ve also explored the Prometheus metrics available in Grafana
However, I’m not clear on how to effectively combine these to create an automated restart trigger

My ideal solution: A setup that monitors my app’s concurrency metric and automatically triggers a machine restart when that metric goes abnormally high, indicating a potential hang state.

Has anyone implemented something similar? I’d appreciate any guidance on:

Best practices for setting up this kind of automated restart
Examples of scripts or tools that might help connect Grafana/Prometheus alerts to the Machines API
Alternative approaches I might not have considered

Thank you in advance for any help or pointers in the right direction!

halfer · May 19, 2025, 8:50pm

The first thing I’d suggest you do is to investigate how to retrieve the metric you’re interested in from Prometheus. It looks like it has an API:

Just bear in mind that there is some talk of withdrawing or redesigning the free tier logging, so bear that in mind. I assume however that if you like Prometheus, you could run it yourself, so it is not bad per se to build on it.

I’ve looked into the Fly Machines API which seems to offer machine start/stop (could be used to restart?) capabilities

Yes, definitely. A small additional app to do this monitoring would be a good approach.

system · May 26, 2025, 9:14pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fly prometheus metrics have been unreliable Questions / Help metrics , grafana	2	450	November 16, 2023
Track application restarts	4	213	January 30, 2023
Can I track concurrency overload? Questions / Help	1	534	December 22, 2021
Does Prometheus keep scraping after a kill signal? How to avoid losing custom metrics during shutdown? Questions / Help metrics , logs	2	22	May 30, 2025
Request: Alerting based on metrics wishlist , metrics , grafana , dashboard	13	1717	May 28, 2025

Restarting Based on Metrics

Related topics