Hello Fly.io community,
I’m running a python application on Fly and have noticed that high concurrency metrics in Grafana often indicate that my application is hanging. I’d like to set up an automated system that can detect when this metric exceeds a threshold and trigger a restart of the affected machine.
What I’ve explored so far:
- I’ve looked into the Fly Machines API which seems to offer machine start/stop (could be used to restart?) capabilities
- I’ve also explored the Prometheus metrics available in Grafana
- However, I’m not clear on how to effectively combine these to create an automated restart trigger
My ideal solution: A setup that monitors my app’s concurrency metric and automatically triggers a machine restart when that metric goes abnormally high, indicating a potential hang state.
Has anyone implemented something similar? I’d appreciate any guidance on:
- Best practices for setting up this kind of automated restart
- Examples of scripts or tools that might help connect Grafana/Prometheus alerts to the Machines API
- Alternative approaches I might not have considered
Thank you in advance for any help or pointers in the right direction!