Preview: Managed Grafana Dashboards for Fly Apps

Thanks for all the feedback so far!

Thanks for the request, no ETA but this is definitely something we can look into. We do have some support for multi-process apps for these kinds of edge-cases, and I could imagine the need to scrape from multiple metrics endpoints in those scenarios.

Just pushed a quick fix, thanks for catching this!

Though I agree it would be useful, allowing users to set up datasources for external instances would make the service more complicated to manage. For now we still recommend using your own Grafana instance for external datasources or further customization along those lines, if only to keep the managed service simple and narrowly focused.

We only added Prometheus to offer a simple service at first, but we’re looking into ways we might eventually expand this to add a built-in datasource for app logs, or maybe even our GraphQL API.

1 Like
  1. On the “fly-app” dashboard: The network-io graph shows the label of “instance + region”. That is more useful than just the “instance” which is shown on all the other graphs and can be used to tune a region.
  2. On the “fly-edge” dashboard, I made a table to show where the edge traffic was going out vs where the instances were. The heatmap and data-in/data-out indicated that there is interesting data, but it was hard to put together. I used these two queries to find out that I didn’t pick the optimal regions. Something like this might be handy on the fly-app dashboard, too. For instance, I have AMS as one of my regions, but the majority of the traffic is exiting out of FRA, etc. I think the “heatmap” on fly-app could be more useful as a table perhaps.
label_uppercase(sum(rate(fly_edge_data_out{app="appname-prod"}[$__range]))by(region), "region") or 0
label_uppercase(sum(rate(fly_instance_net_sent_bytes{app="appname-prod"}[$__range]))by(region), "region") or 0

Edit:

OK, The heatmap on fly app dashboard is actually that, but I didn’t understand how to read it. The circles are the instances and then the heatmap itself shows where the data is exiting from the edges.

2 Likes

Thanks for the feedback! I gave the Fly-App dashboard a small tweak based on your suggestions:

  1. All instance labels now have region appended.
  2. There’s now a tooltip on the Data Out map for better contextual info (and to help distinguish between instance and edge- I agree it’s a bit hard to understand). I’ve also squeezed in a table displaying the same data in table form.

Thanks, really appreciate the changes.

Another comment regarding color scheme for the heatmap. It’s extremely hard to see the dark blue when looking at the map.

image

I accidentally moused over south america and found that there is an edge (GRU) that has traffic.

How do I log out from fly-metrics.net? I have separate work and personal accounts. Thanks

Logging out is currently a bit of a manual process:

  1. Sign out of fly.io (fly.io/app/sign-out, or Account → Sign out from the Dashboard);
  2. Sign out of fly-metrics.net (fly-metrics.net/logout, there’s currently no link in the UI so just go to that page directly).

If you don’t sign out of fly.io first, it’ll automatically log you back in after signing out of fly-metrics.net.

We should be able to make some improvements to this sign-out flow soon enough, but I hope this info helps for now.

1 Like

Not exactly sure what happened, but there’s something going on. For a while, none of the pre-configured dashboards were available and now there seems to be an old version without the region labels on some of the graphs.

Looks like the dashboards got accidentally reverted to a previous version when the instance rebooted, sorry about that and thanks for the heads up!

Should be updated now, you might have to logout (fly-metrics.net/logout) to get it to fully reload the dashboard.

1 Like

@wjordan do you have any resources on how to configure/connect to the Prometheus on Fly datasource for our custom Grafana instance we’re already running on Fly?

Metrics on Fly · Fly Docs should cover connecting a custom Grafana instance to the built-in Prometheus datasource. In short, you connect to https://api.fly.io/prometheus/<org-slug>/, passing your access token in a Authorization: Bearer <token> request header.

1 Like

Perfect. Exactly what I need. Thank you!

I created a new organization, and it was impossible to switch to that organization until I followed these secret, forum-only instructions to log out and log back in.

Love this! Already using the dashboard a ton. Wondering, any plans for having alerting based on these metrics?

E.g. if CPU usage hits 80% for 5min, send a Slack message / email.

Thanks for reporting this issue! Organizations are currently only synchronized with the Grafana service when an access token is created/updated (every 2 hours), so there is an unfortunate delay if you add an org while already signed into Grafana. We should be able to handle this edge-case better with a bit of work.

A sign-out link in Grafana is also still in the works, hitting the logout url manually is just a workaround until then.

We’ve had some discussions, I understand how useful it would be and it’s something we would love to add eventually, but it adds an extra layer of complexity to the service that will take some time to sort out all the details. So no promises or ETAs but we’ll see how it goes!

1 Like

Will there be a SLO for durability of the dashboards when this leaves preview? I had set up a few dashboards at one point and it got deleted (org ID 31932 and dashboard slugs HdfNhGW4z and TdXRjVZ4z created on ~August 21 and lost on ~August 23?).

While we don’t currently have published SLAs, it is a goal of this feature for customers to be able to create their own custom dashboards on this managed instance, and for those dashboards to persist. Data is stored on a Fly Volume so durability expectations are consistent with that feature.

That said, while this service is still in preview things are being changed around quickly, and we’re still working through some application-level bugs/issues that can impact custom settings or dashboard data. On Aug 27-28, a bug in the dashboard-provisioning logic caused all existing dashboards to get unexpectedly deleted, and we didn’t recover this lost data from the daily backups in time. Apologies to you and any other early-adopters who lost work as a result. Making sure something like that doesn’t happen again is a top priority for this feature.

1 Like

Hmmm…not sure where to report this, but it is unusual.

On the 22nd, for some reason all the VM’s moved to SJC. I restarted them and as you can see the concurrency has increased quite a bit. However, it is not due to traffic, nor does the open socket count on the application reflect this. For instance, there are not 150 sockets open i on f5c0bfef-dfw.

root@f5c0bfef:/# lsof -ni|grep 4000|grep EST|wc -l
36
root@f5c0bfef:/# lsof -ni|grep 4000|grep EST
beam.smp 516 nobody  106u  IPv6 2427246      0t0  TCP 172.19.2.194:4000->172.19.2.193:47482 (ESTABLISHED)
beam.smp 516 nobody  107u  IPv6 2422583      0t0  TCP 172.19.2.194:4000->172.19.2.193:37636 (ESTABLISHED)
beam.smp 516 nobody  110u  IPv6 2425347      0t0  TCP 172.19.2.194:4000->172.19.2.193:57078 (ESTABLISHED)
beam.smp 516 nobody  111u  IPv6 2179534      0t0  TCP 172.19.2.194:4000->172.19.2.193:60280 (ESTABLISHED)
beam.smp 516 nobody  112u  IPv6 2418635      0t0  TCP 172.19.2.194:4000->172.19.2.193:33204 (ESTABLISHED)
beam.smp 516 nobody  113u  IPv6 2424652      0t0  TCP 172.19.2.194:4000->172.19.2.193:44720 (ESTABLISHED)
beam.smp 516 nobody  114u  IPv6 2425023      0t0  TCP 172.19.2.194:4000->172.19.2.193:36488 (ESTABLISHED)
beam.smp 516 nobody  115u  IPv6 2358509      0t0  TCP 172.19.2.194:4000->172.19.2.193:32782 (ESTABLISHED)
beam.smp 516 nobody  117u  IPv6 2419253      0t0  TCP 172.19.2.194:4000->172.19.2.193:38364 (ESTABLISHED)
beam.smp 516 nobody  119u  IPv6 2370230      0t0  TCP 172.19.2.194:4000->172.19.2.193:46374 (ESTABLISHED)
beam.smp 516 nobody  120u  IPv6 2370232      0t0  TCP 172.19.2.194:4000->172.19.2.193:46414 (ESTABLISHED)
beam.smp 516 nobody  121u  IPv6 2370237      0t0  TCP 172.19.2.194:4000->172.19.2.193:46532 (ESTABLISHED)
beam.smp 516 nobody  122u  IPv6 2370240      0t0  TCP 172.19.2.194:4000->172.19.2.193:46560 (ESTABLISHED)
beam.smp 516 nobody  124u  IPv6 2272257      0t0  TCP 172.19.2.194:4000->172.19.2.193:53300 (ESTABLISHED)
beam.smp 516 nobody  125u  IPv6 2387551      0t0  TCP 172.19.2.194:4000->172.19.2.193:45312 (ESTABLISHED)
beam.smp 516 nobody  126u  IPv6 2358622      0t0  TCP 172.19.2.194:4000->172.19.2.193:38936 (ESTABLISHED)
beam.smp 516 nobody  127u  IPv6 2425351      0t0  TCP 172.19.2.194:4000->172.19.2.193:57208 (ESTABLISHED)
beam.smp 516 nobody  128u  IPv6 2395332      0t0  TCP 172.19.2.194:4000->172.19.2.193:42980 (ESTABLISHED)
beam.smp 516 nobody  129u  IPv6 2404045      0t0  TCP 172.19.2.194:4000->172.19.2.193:38392 (ESTABLISHED)
beam.smp 516 nobody  130u  IPv6 2427257      0t0  TCP 172.19.2.194:4000->172.19.2.193:48766 (ESTABLISHED)
beam.smp 516 nobody  131u  IPv6 2339453      0t0  TCP 172.19.2.194:4000->172.19.2.193:48010 (ESTABLISHED)
beam.smp 516 nobody  132u  IPv6 2315387      0t0  TCP 172.19.2.194:4000->172.19.2.193:51860 (ESTABLISHED)
beam.smp 516 nobody  133u  IPv6 2420196      0t0  TCP 172.19.2.194:4000->172.19.2.193:34526 (ESTABLISHED)
beam.smp 516 nobody  134u  IPv6 2419124      0t0  TCP 172.19.2.194:4000->172.19.2.193:62902 (ESTABLISHED)
beam.smp 516 nobody  135u  IPv6 2370246      0t0  TCP 172.19.2.194:4000->172.19.2.193:46746 (ESTABLISHED)
beam.smp 516 nobody  136u  IPv6 2370251      0t0  TCP 172.19.2.194:4000->172.19.2.193:46908 (ESTABLISHED)
beam.smp 516 nobody  137u  IPv6 2419756      0t0  TCP 172.19.2.194:4000->172.19.2.193:39558 (ESTABLISHED)
beam.smp 516 nobody  138u  IPv6 2432811      0t0  TCP 172.19.2.194:4000->172.19.2.193:55730 (ESTABLISHED)
beam.smp 516 nobody  139u  IPv6 2340864      0t0  TCP 172.19.2.194:4000->172.19.2.193:33472 (ESTABLISHED)
beam.smp 516 nobody  140u  IPv6 2432620      0t0  TCP 172.19.2.194:4000->172.19.2.193:40462 (ESTABLISHED)
beam.smp 516 nobody  141u  IPv6 2370262      0t0  TCP 172.19.2.194:4000->172.19.2.193:47318 (ESTABLISHED)
beam.smp 516 nobody  142u  IPv6 2432681      0t0  TCP 172.19.2.194:4000->172.19.2.193:44976 (ESTABLISHED)
beam.smp 516 nobody  144u  IPv6 2431651      0t0  TCP 172.19.2.194:4000->172.19.2.193:41706 (ESTABLISHED)
beam.smp 516 nobody  145u  IPv6 2431924      0t0  TCP 172.19.2.194:4000->172.19.2.193:59934 (ESTABLISHED)
beam.smp 516 nobody  146u  IPv6 2431926      0t0  TCP 172.19.2.194:4000->172.19.2.193:59936 (ESTABLISHED)
beam.smp 516 nobody  151u  IPv6 2312661      0t0  TCP 172.19.2.194:4000->172.19.2.193:42548 (ESTABLISHED)

Some decrement counter not being set somewhere?

Any plans for exposing the Grafana API? Specifically I would like to be able to have my application define my dashboards and send annotations whenever I deploy as per this: Monitoring Elixir Apps on Fly.io With Prometheus and PromEx · Fly

And I’m a +1 for alerting

2 Likes

Is there an API for retrieving billing and costs? I have a dashboard for our hourly costs for some other services and it would be neat to see hourly cost graphs as apps and things are created and destroyed

1 Like