Preview: Managed Grafana Dashboards for Fly Apps

We just launched fly-metrics.net, a preview of a managed Grafana service for Fly.io that you can use for setting up advanced metrics queries and detailed dashboards.

The service is linked to your Fly.io account and is automatically integrated with the built-in Prometheus data sources in your Fly.io Organizations. It comes provisioned with three new detailed dashboards that visualize all of the built-in metrics published from your apps (except Postgres for now). You can also use the Explore tab and create new dashboards for visualizing custom metrics or any further customization.

You can switch orgs using the ‘Switch organization’ menu item in the profile flyout at the bottom-left of the screen:

image

Though you’ve always been able to set up your own Grafana instance, we hope this managed service will be a convenient tool for quickly getting started with Prometheus metrics without the friction of hooking everything up yourself. The built-in dashboards help you quickly dig into what’s happening with your apps in a little more detail than we can fit on the Metrics tab on the Fly.io dashboard.

Please try it out, let us know what you think and send feedback and suggestions! We’d love to keep improving this service and make it even more useful to you all.

23 Likes

Nice! Any plans to support multiple metrics endpoints per app? I know most apps should “only be one thing” but sometimes there are two services it makes sense to run together that both expose some metrics.

1 Like

This looks awesome! I’m always impressed with the speed at which all you folks at fly release these amazing features/services.

We’re running fly-log-shipper in our account connected to an instance of Grafana Loki. I’d love to be able to add our Loki instance as a datasource here. (Or have fly manage that as well. You do already have all of our logs… :wink:)

I did notice that when I select “Switch organization” I see one of my organizations repeated 5 times. It’s an org that I created and then deleted a handful of times and I’m guessing has something to do with that.

3 Likes

I wrote a review last week on the elixir forums and said the dashboard was passable, but this is damn nice.

3 Likes

Thanks for all the feedback so far!

Thanks for the request, no ETA but this is definitely something we can look into. We do have some support for multi-process apps for these kinds of edge-cases, and I could imagine the need to scrape from multiple metrics endpoints in those scenarios.

Just pushed a quick fix, thanks for catching this!

Though I agree it would be useful, allowing users to set up datasources for external instances would make the service more complicated to manage. For now we still recommend using your own Grafana instance for external datasources or further customization along those lines, if only to keep the managed service simple and narrowly focused.

We only added Prometheus to offer a simple service at first, but we’re looking into ways we might eventually expand this to add a built-in datasource for app logs, or maybe even our GraphQL API.

1 Like
  1. On the “fly-app” dashboard: The network-io graph shows the label of “instance + region”. That is more useful than just the “instance” which is shown on all the other graphs and can be used to tune a region.
  2. On the “fly-edge” dashboard, I made a table to show where the edge traffic was going out vs where the instances were. The heatmap and data-in/data-out indicated that there is interesting data, but it was hard to put together. I used these two queries to find out that I didn’t pick the optimal regions. Something like this might be handy on the fly-app dashboard, too. For instance, I have AMS as one of my regions, but the majority of the traffic is exiting out of FRA, etc. I think the “heatmap” on fly-app could be more useful as a table perhaps.
label_uppercase(sum(rate(fly_edge_data_out{app="appname-prod"}[$__range]))by(region), "region") or 0
label_uppercase(sum(rate(fly_instance_net_sent_bytes{app="appname-prod"}[$__range]))by(region), "region") or 0

Edit:

OK, The heatmap on fly app dashboard is actually that, but I didn’t understand how to read it. The circles are the instances and then the heatmap itself shows where the data is exiting from the edges.

1 Like

Thanks for the feedback! I gave the Fly-App dashboard a small tweak based on your suggestions:

  1. All instance labels now have region appended.
  2. There’s now a tooltip on the Data Out map for better contextual info (and to help distinguish between instance and edge- I agree it’s a bit hard to understand). I’ve also squeezed in a table displaying the same data in table form.

Thanks, really appreciate the changes.

Another comment regarding color scheme for the heatmap. It’s extremely hard to see the dark blue when looking at the map.

image

I accidentally moused over south america and found that there is an edge (GRU) that has traffic.

How do I log out from fly-metrics.net? I have separate work and personal accounts. Thanks

Logging out is currently a bit of a manual process:

  1. Sign out of fly.io (fly.io/app/sign-out, or Account → Sign out from the Dashboard);
  2. Sign out of fly-metrics.net (fly-metrics.net/logout, there’s currently no link in the UI so just go to that page directly).

If you don’t sign out of fly.io first, it’ll automatically log you back in after signing out of fly-metrics.net.

We should be able to make some improvements to this sign-out flow soon enough, but I hope this info helps for now.

1 Like

Not exactly sure what happened, but there’s something going on. For a while, none of the pre-configured dashboards were available and now there seems to be an old version without the region labels on some of the graphs.

Looks like the dashboards got accidentally reverted to a previous version when the instance rebooted, sorry about that and thanks for the heads up!

Should be updated now, you might have to logout (fly-metrics.net/logout) to get it to fully reload the dashboard.

1 Like

@wjordan do you have any resources on how to configure/connect to the Prometheus on Fly datasource for our custom Grafana instance we’re already running on Fly?

Metrics on Fly · Fly Docs should cover connecting a custom Grafana instance to the built-in Prometheus datasource. In short, you connect to https://api.fly.io/prometheus/<org-slug>/, passing your access token in a Authorization: Bearer <token> request header.

1 Like

Perfect. Exactly what I need. Thank you!

I created a new organization, and it was impossible to switch to that organization until I followed these secret, forum-only instructions to log out and log back in.

Love this! Already using the dashboard a ton. Wondering, any plans for having alerting based on these metrics?

E.g. if CPU usage hits 80% for 5min, send a Slack message / email.

Thanks for reporting this issue! Organizations are currently only synchronized with the Grafana service when an access token is created/updated (every 2 hours), so there is an unfortunate delay if you add an org while already signed into Grafana. We should be able to handle this edge-case better with a bit of work.

A sign-out link in Grafana is also still in the works, hitting the logout url manually is just a workaround until then.

We’ve had some discussions, I understand how useful it would be and it’s something we would love to add eventually, but it adds an extra layer of complexity to the service that will take some time to sort out all the details. So no promises or ETAs but we’ll see how it goes!

Will there be a SLO for durability of the dashboards when this leaves preview? I had set up a few dashboards at one point and it got deleted (org ID 31932 and dashboard slugs HdfNhGW4z and TdXRjVZ4z created on ~August 21 and lost on ~August 23?).