So far the UX we have around that is both showing something on your logs if it comes from the proxy or showing on your app Monitoring Page or using fly checks list -a APPNAME. But that only tells you what is happening now.
We’d love to hear your opinions on what you think you need to get the most out of our health checks so you can improve the reliability of your apps.
Don’t really need it for AppsV2 since VMs recover like clockwork (we see at least 3 OOMs among 30+ machines a day due to the nature of the service we run [0]), and I don’t see zombies or phantoms or ghost VMs anymore, which was a huge uptime problem before.
That said, a webhook / email (to a custom address) on health-check failures (or any VM down events; or better on ALL VMs down events) would be neat.
[0] The Fly proxy needs to expose a token bucket like addmission control (burst + fill rate), because a barrage of requests is usually what causes these OOMs.