Help me understand metrics and issues

p-tek · March 17, 2023, 2:44pm

Hi,

I’m not quite sure I understand how to read the metrics in the Dashboard and Grafana, as I can’t get them to make sense. In the Dashboard, sometimes I see HTTP response times measured in seconds. When I click the “Grafana” button to open the metrics in Grafana, the response times are much lower (usually < 500 milliseconds). Is there a bug in the Dashboard that inflates the response times, or are these different queries? This usually happens when monitoring reports that one or more of the app instances are in a critical state.

I just had a case of the Dashboard showing 3 seconds for the 0.5, < 500ms for 0.95 (!) and over 5 seconds for 0.99. It looks wrong, and also completely different from the Grafana stats for the same timespan.

I also don’t fully get why my apps go critical so often. I assume Fly reports them as critical when the health check is not responding, but I can see from my application logs that the health check endpoint is responding as usual (and it is responding in less than 1 millisecond - just a simple 200 OK). I just had two of three nodes go critical. In the logs, I could see health checks being performed. When I restarted the apps, they became healthy again. There are no errors in the app logs. Memory usage is far below max for the instances. Dedicated CPU VMs.

It looks to me like the issues are outside the application, but if anybody has experienced the same type of behavior from their apps and managed to fix it, I’d be happy to learn about it.

jphenow · March 17, 2023, 7:32pm

Hey! If you select Fly Edge (top right in Grafana) > HTTP on Grafana do those look closer to what you’re seeing on the dashboard?

We have different metrics for various app lifecycle and edge request/response lifecycle. Edge response time is basically app response time + the networking/proxying to respond from our edge nodes.

Clicking “Show Queries” on the Fly.o dashboard metrics will show the query we run to get those. You should be able to inspect the JSON for the Grafana dashboard panels as well if you’re interested in comparing closer.

Could you share some bits of your fly.toml here?

p-tek · March 17, 2023, 7:52pm

Thank you for getting back to me on this.

I checked the Fly Edge metrics, and these match what I see on the Dashboard with the long response times. Does this mean that the slowdown is happening in the network, before/after the request hits the app?

Here is most of the fly.toml config:

kill_signal = "SIGINT"
kill_timeout = 5
processes = []

[experimental]
  auto_rollback = true

[[services]]
  internal_port = 8080
  processes = ["app"]
  protocol = "tcp"
  script_checks = []
  [services.concurrency]
    hard_limit = 25
    soft_limit = 20
    type = "connections"

  [[services.ports]]
    force_https = true
    handlers = ["http"]
    port = 80

  [[services.ports]]
    handlers = ["tls", "http"]
    port = 443

  [[services.http_checks]]
    interval = 10000
    grace_period = "5s"
    method = "get"
    path = "/health"
    protocol = "http"
    restart_limit = 0
    timeout = 2000
    tls_skip_verify = false

jphenow · March 21, 2023, 2:00pm

Hmm it sounds that way. Has it been slow for you in particular or are you only seeing this in metrics? Do you know what region your requests are coming from and what regions your app is running in?

If its a web app sometimes it helps open the network tab and see if anything is coming down particularly slowly.

Could you share a copy/paste or screenshot of what you’re seeing indicate your app is critical?

Edit: fly checks list --app <your-app> may give you a bit more info as well if you haven’t found more info other than that it’s critical. Checking logs may indicate some things too!

p-tek · March 27, 2023, 10:49am

This is an API, so what happens when the response times are high is that some of the consumers time out, as they have set limits as to how long they will wait for a reply. The requests mainly come from Frankfurt and the US. There are just a few consumers using the API, and the volume of requests is low.

To see that the app is critical, I check the status of the health check in the monitoring tab in the Dashboard. It will list e.g. “1 total, 1 critical” for the affected instances. One of the endpoints the app exposes echoes the version of the current deployment. When the app is flagged as critical, this endpoint will often timeout or be very slow to respond. The data in this endpoint is served from memory, as the version info is read from build time assets during startup and cached. Given low request volume and no IO, it should respond very fast.

jphenow · March 29, 2023, 4:00pm

Bah, that’s weird. When you notice it go critical - running that fly checks list might get you some more clues, we put some of our check logs in that output.

Taking a stab at some things you could try though depending on your situation. How is the API accessed? I wonder if something with TLS is off and it’s manifesting as a critical healthcheck failure.

Here’s someone who only accesses their app via flycast (no public IP) that had to disable force_https: Flycast ip doesn't resolve to app. This is probably less desirable if you’re intending to access the API publicly.

You might also check if your certificates are operational: Certificate issuance taking a long time. You can check your certificates on the dashboard under [your app] > Certificates or via CLI using fly certs.

scriptjs · March 30, 2023, 12:29pm

@jphenow It should be noted that I terminate SSL on my my side and use Let’s Encrypt to generate a cert.

p-tek · March 31, 2023, 12:01pm

The app is accessed over public ip using HTTPS. The certificates are all valid and operational from what I can see on the Dashboard and the fly certs commands.

For now I am writing this up as infrastructure instability that somehow ends up as the application not passing health checks.

system · April 7, 2023, 12:01pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
can i modify the Grafana dashboard to view HTTP response times to a lower granularity Questions / Help metrics , grafana	4	516	May 4, 2024
Metrics: `HTTP STATUS CODES` values are confusing Questions / Help metrics	9	3919	June 18, 2022
Dashboard metrics are not available	12	785	December 4, 2023
fly.io instance response times Questions / Help metrics , troubleshooting , proxy	24	569	February 18, 2025
Prometheus endpoint appears to be timing out / returning 5xx	2	875	January 20, 2022

Help me understand metrics and issues

Related topics