Metrics for debugging connection losses

I have an app that is the gRPC endpoint for long-running processes. From time to time, I see unfortunate connection losses that can last quite long. What I see in the Grafana metrics is that the instance is up and apparently working fine (metrics like fly_instance_memory_mem_free look okay), but the metrics relating to the app and the edge (e.g., fly_app_concurrency, fly_edge_data_in) have a gap. For example, I recently had such a gap lasting about 15 minutes (on 2024-11-12 in the AMS region). This is much more than the disconnection timeouts that I want to set in my app.

Is anyone else seeing such connection losses? What kind of metrics are you looking at to help debug the issue? If the problem is not in my configuration, I guess this is an internal reliability issue, and there is not much I can do.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.