Help me understand metrics and issues


I’m not quite sure I understand how to read the metrics in the Dashboard and Grafana, as I can’t get them to make sense. In the Dashboard, sometimes I see HTTP response times measured in seconds. When I click the “Grafana” button to open the metrics in Grafana, the response times are much lower (usually < 500 milliseconds). Is there a bug in the Dashboard that inflates the response times, or are these different queries? This usually happens when monitoring reports that one or more of the app instances are in a critical state.

I just had a case of the Dashboard showing 3 seconds for the 0.5, < 500ms for 0.95 (!) and over 5 seconds for 0.99. It looks wrong, and also completely different from the Grafana stats for the same timespan.

I also don’t fully get why my apps go critical so often. I assume Fly reports them as critical when the health check is not responding, but I can see from my application logs that the health check endpoint is responding as usual (and it is responding in less than 1 millisecond - just a simple 200 OK). I just had two of three nodes go critical. In the logs, I could see health checks being performed. When I restarted the apps, they became healthy again. There are no errors in the app logs. Memory usage is far below max for the instances. Dedicated CPU VMs.

It looks to me like the issues are outside the application, but if anybody has experienced the same type of behavior from their apps and managed to fix it, I’d be happy to learn about it.

Hey! If you select Fly Edge (top right in Grafana) > HTTP on Grafana do those look closer to what you’re seeing on the dashboard?

We have different metrics for various app lifecycle and edge request/response lifecycle. Edge response time is basically app response time + the networking/proxying to respond from our edge nodes.

Clicking “Show Queries” on the Fly.o dashboard metrics will show the query we run to get those. You should be able to inspect the JSON for the Grafana dashboard panels as well if you’re interested in comparing closer.

Could you share some bits of your fly.toml here?

Thank you for getting back to me on this.

I checked the Fly Edge metrics, and these match what I see on the Dashboard with the long response times. Does this mean that the slowdown is happening in the network, before/after the request hits the app?

Here is most of the fly.toml config:

kill_signal = "SIGINT"
kill_timeout = 5
processes = []

  auto_rollback = true

  internal_port = 8080
  processes = ["app"]
  protocol = "tcp"
  script_checks = []
    hard_limit = 25
    soft_limit = 20
    type = "connections"

    force_https = true
    handlers = ["http"]
    port = 80

    handlers = ["tls", "http"]
    port = 443

    interval = 10000
    grace_period = "5s"
    method = "get"
    path = "/health"
    protocol = "http"
    restart_limit = 0
    timeout = 2000
    tls_skip_verify = false

Hmm it sounds that way. Has it been slow for you in particular or are you only seeing this in metrics? Do you know what region your requests are coming from and what regions your app is running in?

If its a web app sometimes it helps open the network tab and see if anything is coming down particularly slowly.

Could you share a copy/paste or screenshot of what you’re seeing indicate your app is critical?

Edit: fly checks list --app <your-app> may give you a bit more info as well if you haven’t found more info other than that it’s critical. Checking logs may indicate some things too!

1 Like