Help me understand metrics and issues

Hi,

I’m not quite sure I understand how to read the metrics in the Dashboard and Grafana, as I can’t get them to make sense. In the Dashboard, sometimes I see HTTP response times measured in seconds. When I click the “Grafana” button to open the metrics in Grafana, the response times are much lower (usually < 500 milliseconds). Is there a bug in the Dashboard that inflates the response times, or are these different queries? This usually happens when monitoring reports that one or more of the app instances are in a critical state.

I just had a case of the Dashboard showing 3 seconds for the 0.5, < 500ms for 0.95 (!) and over 5 seconds for 0.99. It looks wrong, and also completely different from the Grafana stats for the same timespan.

I also don’t fully get why my apps go critical so often. I assume Fly reports them as critical when the health check is not responding, but I can see from my application logs that the health check endpoint is responding as usual (and it is responding in less than 1 millisecond - just a simple 200 OK). I just had two of three nodes go critical. In the logs, I could see health checks being performed. When I restarted the apps, they became healthy again. There are no errors in the app logs. Memory usage is far below max for the instances. Dedicated CPU VMs.

It looks to me like the issues are outside the application, but if anybody has experienced the same type of behavior from their apps and managed to fix it, I’d be happy to learn about it.

Hey! If you select Fly Edge (top right in Grafana) > HTTP on Grafana do those look closer to what you’re seeing on the dashboard?

We have different metrics for various app lifecycle and edge request/response lifecycle. Edge response time is basically app response time + the networking/proxying to respond from our edge nodes.

Clicking “Show Queries” on the Fly.o dashboard metrics will show the query we run to get those. You should be able to inspect the JSON for the Grafana dashboard panels as well if you’re interested in comparing closer.

Could you share some bits of your fly.toml here?

Thank you for getting back to me on this.

I checked the Fly Edge metrics, and these match what I see on the Dashboard with the long response times. Does this mean that the slowdown is happening in the network, before/after the request hits the app?

Here is most of the fly.toml config:

kill_signal = "SIGINT"
kill_timeout = 5
processes = []

[experimental]
  auto_rollback = true

[[services]]
  internal_port = 8080
  processes = ["app"]
  protocol = "tcp"
  script_checks = []
  [services.concurrency]
    hard_limit = 25
    soft_limit = 20
    type = "connections"

  [[services.ports]]
    force_https = true
    handlers = ["http"]
    port = 80

  [[services.ports]]
    handlers = ["tls", "http"]
    port = 443

  [[services.http_checks]]
    interval = 10000
    grace_period = "5s"
    method = "get"
    path = "/health"
    protocol = "http"
    restart_limit = 0
    timeout = 2000
    tls_skip_verify = false

Hmm it sounds that way. Has it been slow for you in particular or are you only seeing this in metrics? Do you know what region your requests are coming from and what regions your app is running in?

If its a web app sometimes it helps open the network tab and see if anything is coming down particularly slowly.


Could you share a copy/paste or screenshot of what you’re seeing indicate your app is critical?

Edit: fly checks list --app <your-app> may give you a bit more info as well if you haven’t found more info other than that it’s critical. Checking logs may indicate some things too!

1 Like

This is an API, so what happens when the response times are high is that some of the consumers time out, as they have set limits as to how long they will wait for a reply. The requests mainly come from Frankfurt and the US. There are just a few consumers using the API, and the volume of requests is low.

To see that the app is critical, I check the status of the health check in the monitoring tab in the Dashboard. It will list e.g. “1 total, 1 critical” for the affected instances. One of the endpoints the app exposes echoes the version of the current deployment. When the app is flagged as critical, this endpoint will often timeout or be very slow to respond. The data in this endpoint is served from memory, as the version info is read from build time assets during startup and cached. Given low request volume and no IO, it should respond very fast.

Bah, that’s weird. When you notice it go critical - running that fly checks list might get you some more clues, we put some of our check logs in that output.


Taking a stab at some things you could try though depending on your situation. How is the API accessed? I wonder if something with TLS is off and it’s manifesting as a critical healthcheck failure.

Here’s someone who only accesses their app via flycast (no public IP) that had to disable force_https: Flycast ip doesn't resolve to app. This is probably less desirable if you’re intending to access the API publicly.

You might also check if your certificates are operational: Certificate issuance taking a long time. You can check your certificates on the dashboard under [your app] > Certificates or via CLI using fly certs.

@jphenow It should be noted that I terminate SSL on my my side and use Let’s Encrypt to generate a cert.

1 Like

The app is accessed over public ip using HTTPS. The certificates are all valid and operational from what I can see on the Dashboard and the fly certs commands.

For now I am writing this up as infrastructure instability that somehow ends up as the application not passing health checks.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.