Restarting Based on Metrics

Hello Fly.io community,

I’m running a python application on Fly and have noticed that high concurrency metrics in Grafana often indicate that my application is hanging. I’d like to set up an automated system that can detect when this metric exceeds a threshold and trigger a restart of the affected machine.

What I’ve explored so far:

  • I’ve looked into the Fly Machines API which seems to offer machine start/stop (could be used to restart?) capabilities
  • I’ve also explored the Prometheus metrics available in Grafana
  • However, I’m not clear on how to effectively combine these to create an automated restart trigger

My ideal solution: A setup that monitors my app’s concurrency metric and automatically triggers a machine restart when that metric goes abnormally high, indicating a potential hang state.

Has anyone implemented something similar? I’d appreciate any guidance on:

  1. Best practices for setting up this kind of automated restart
  2. Examples of scripts or tools that might help connect Grafana/Prometheus alerts to the Machines API
  3. Alternative approaches I might not have considered

Thank you in advance for any help or pointers in the right direction!

The first thing I’d suggest you do is to investigate how to retrieve the metric you’re interested in from Prometheus. It looks like it has an API:

Just bear in mind that there is some talk of withdrawing or redesigning the free tier logging, so bear that in mind. I assume however that if you like Prometheus, you could run it yourself, so it is not bad per se to build on it.

I’ve looked into the Fly Machines API which seems to offer machine start/stop (could be used to restart?) capabilities

Yes, definitely. A small additional app to do this monitoring would be a good approach.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.