Early access: build Grafana dashboard from Fly metrics

Check out our shiny new metrics engine: Feature preview: Custom metrics

We have a chunky Prometheus (well, VictoriaMetrics) cluster with all kinds of useful application metrics. It’s what powers the graphs on our web UI, but there’s a bunch more in there.

You can use these with Grafana to make neat dashboards, alert on metrics, etc.

We’re working up a prebuild Grafana dashboard with some interesting graphs, but if you want to try it out before instructions examples, you should!

API Details

  1. The Prometheus API is available at https://api.fly.io/prometheus/api/v1/
  2. Send an Authorization: Bearer <TOKEN> to authenticate (you can run flyctl auth token to get your token)
  3. Run some queries

Grafana Cloud Setup

If you don’t have Grafana yet, the free Grafana Cloud plan will work fine for this: Grafana Cloud | Grafana Labs

Once you’re in your spiff new Grafana instance, add a Prometheus source like so (note the URL and “Custom HTTP Headers” section:

From there, you can create a Dashboard, then add a Panel. Try a query like this:

sum(rate(edge_http_responses_count{app="<APP NAME>"}[$__interval])) by (status)

Series and Labels

We’ve exposed a bunch of different series, you will need to supply an app="<NAME>" argument to each. Here’s a quick list:

// responses from load balancer
edge_http_responses_count
edge_http_response_time_seconds_bucket

// connections to anycast IPs
edge_tcp_connects_count
edge_tcp_disconnects_count

// responses from app vms using http handler
app_http_responses_count
app_http_response_time_seconds_bucket
app_local_connect_time_seconds_bucket

// tcp connections to app vms
app_local_connects_count
app_local_disconnects_count

// data out through load balancer
anycast_data_out
anycast_data_in

// vm memory metrics
firecracker_vm_memory_buffers
firecracker_vm_memory_cached
firecracker_vm_memory_mem_free
firecracker_vm_memory_mem_available
firecracker_vm_memory_mem_total
firecracker_vm_memory_swap_cached
firecracker_vm_memory_vmalloc_used
firecracker_vm_memory_active
firecracker_vm_memory_inactive

// other vm metrics
firecracker_vm_cpu
firecracker_vm_load_average
firecracker_vm_net_sent_bytes
firecracker_vm_disk_time_io

// network interface metrics
node_network_transmit_bytes
node_network_receive_bytes

Give it a try, let us know what you think.

Known missing pieces

We’re missing a few features to really make Grafana nice:

  1. Support for autocomplete in queries. This will be fixed before we officially™️ ship this.
  2. Querying metrics for multiple apps: right now, queries need to include an app name, there’s no way to get metrics that combine apps.
6 Likes

Grafana has a neat map visualization:

Just add this: https://grafana.com/grafana/plugins/grafana-worldmap-panel

Use a query like this:

sum(rate(edge_http_responses_count{app="<NAME>"}[$__interval])) by (region)

Then set the map data JSON endpoint to https://api.fly.io/meta/regions.json:

1 Like

This is amazing!

2 Likes

Heatmaps are pretty great for showing response times:

Settings

  1. Query:
     sum(increase(edge_http_response_time_ns_bucket{app="$app"}[$__interval])) by (le)
    
    • Legend: {{le}}
    • Min step: 1m
    • Format: Heatmap
    • Visualization: Heat map
  2. Axes: Y-Axis
    • Unit: seconds
    • Data format: Time series buckets
  3. Display
    • Colors: spectrum
    • Scheme: Plasma
    • Color scale min: 0
    • Show legend:

Grafana JSON

{
  "type": "heatmap",
  "title": "Response Times",
  "gridPos": {
    "x": 9,
    "y": 14,
    "w": 9,
    "h": 9
  },
  "id": 23763571993,
  "targets": [
    {
      "expr": "sum(increase(edge_http_response_time_ns_bucket{app=\"$app\"}[$__interval])) by (le)",
      "legendFormat": "{{le}}",
      "interval": "1m",
      "refId": "A",
      "format": "heatmap"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "custom": {}
    },
    "overrides": []
  },
  "pluginVersion": "7.1.5",
  "legend": {
    "show": true
  },
  "tooltip": {
    "show": true,
    "showHistogram": false
  },
  "heatmap": {},
  "cards": {
    "cardPadding": null,
    "cardRound": null
  },
  "color": {
    "mode": "spectrum",
    "cardColor": "#b4ff00",
    "colorScale": "sqrt",
    "exponent": 0.5,
    "colorScheme": "interpolatePlasma",
    "min": 0
  },
  "dataFormat": "tsbuckets",
  "yBucketBound": "middle",
  "xAxis": {
    "show": true
  },
  "yAxis": {
    "show": true,
    "format": "s",
    "decimals": 0,
    "logBase": 1,
    "splitFactor": null,
    "min": "0",
    "max": null
  },
  "highlightCards": true,
  "timeFrom": null,
  "timeShift": null,
  "reverseYBuckets": false,
  "xBucketSize": null,
  "xBucketNumber": null,
  "yBucketSize": null,
  "yBucketNumber": null,
  "hideZeroBuckets": false,
  "datasource": null
}

Hi! I’d be eager to try this out but I can’t get it work.
https://api.fly.io/prometheus/api/v1 responds with 404.
Is this still available for us to play with?

It should be, but the endpoint should be https://api.fly.io/prometheus/ for the Grafana data sources (as per the screenshot).

Dj

That doesn’t work either:(

Ah! We’ve only implemented these two URLs so far:

Grafana “knows” about these, so when you set it up all you have to tell it is https://api.fly.io/prometheus/. We use the Ruby Prometheus client, which also just needs the base URL. But for something like cURL you’ll need a more specific endpoint.

2 Likes

Now it works, thanks!

Good deal! Definitely let us know if you run into any problems, this could be quite powerful and we want to make sure it’s nice and solid before we launch it.

Works fine so far, impressive work! I can’t get the worldmap working though.

The worldmap is a little flakey it seems like. Here’s the panel JSON for the one I have working:

{
  "circleMaxSize": "10",
  "circleMinSize": 2,
  "colors": [
    "rgba(245, 54, 54, 0.9)",
    "rgba(237, 129, 40, 0.89)",
    "rgba(50, 172, 45, 0.97)"
  ],
  "decimals": 0,
  "esMetric": "Count",
  "fieldConfig": {
    "defaults": {
      "custom": {
        "align": null
      },
      "mappings": [],
      "thresholds": {
        "mode": "absolute",
        "steps": [
          {
            "color": "green",
            "value": null
          },
          {
            "color": "red",
            "value": 80
          }
        ]
      }
    },
    "overrides": []
  },
  "gridPos": {
    "h": 14,
    "w": 18,
    "x": 0,
    "y": 0
  },
  "hideEmpty": false,
  "hideZero": false,
  "id": 4,
  "initialZoom": 1,
  "jsonUrl": "https://api.fly.io/meta/regions.json",
  "locationData": "json endpoint",
  "mapCenter": "(0°, 0°)",
  "mapCenterLatitude": 0,
  "mapCenterLongitude": 0,
  "maxDataPoints": 1,
  "mouseWheelZoom": false,
  "pluginVersion": "7.1.5",
  "showLegend": true,
  "stickyLabels": false,
  "tableQueryOptions": {
    "geohashField": "geohash",
    "latitudeField": "latitude",
    "longitudeField": "longitude",
    "metricField": "metric",
    "queryType": "geohash"
  },
  "targets": [
    {
      "expr": "sum(rate(edge_http_responses_count{app=\"$app\"}[$__interval])) by (region)",
      "interval": "",
      "legendFormat": "{{status}}",
      "refId": "A"
    }
  ],
  "thresholds": "0,10",
  "timeFrom": null,
  "timeShift": null,
  "title": "Requests by Region",
  "type": "grafana-worldmap-panel",
  "unitPlural": "",
  "unitSingle": "",
  "valueName": "total",
  "datasource": null
}

Thanks, it helped. "legendFormat": "{{status}}" was missing for me.

1 Like

Oh nice! If you end up doing other interesting stuff in Grafana, will you post about it here?

Sure. Everything looks pretty solid – maybe the negative memory consumption once today was a bit strange.

Oh that’s odd! We’re not that good at VMs yet.

What query gave you negatives on memory? I can have a look and see why it might’ve done that.

avg(firecracker_vm_memory_mem_total{app="$__app"}) - avg(firecracker_vm_memory_mem_free{app="$__app"})

Btw, would it be possible to collect app specific metrics too? Maybe as a backing service similarly to Redis?

Yes we’d like to let people collect custom metrics. It’s a fun technical challenge because Prometheus like databases have problems with too many different series (metric + label). High cardinality breaks things and/or makes them expensive to run.

We’ve experimented with this a little, if you send a fly-cache-status: HIT header you’ll see a chart appear on our ui. You can use any status you want there. We might expose other named metrics and let apps populate them.

The next step is to make it easy to scrape metrics from your app instances so you can turn on something like paid Grafana cloud and at least get them there!

Great, keep us informed!
Not sure how do you mean sending a fly-cache-status. Send it from an instance in an HTTP response?