Thanks for the request, no ETA but this is definitely something we can look into. We do have some support for multi-process apps for these kinds of edge-cases, and I could imagine the need to scrape from multiple metrics endpoints in those scenarios.
Just pushed a quick fix, thanks for catching this!
Though I agree it would be useful, allowing users to set up datasources for external instances would make the service more complicated to manage. For now we still recommend using your own Grafana instance for external datasources or further customization along those lines, if only to keep the managed service simple and narrowly focused.
We only added Prometheus to offer a simple service at first, but we’re looking into ways we might eventually expand this to add a built-in datasource for app logs, or maybe even our GraphQL API.
On the “fly-app” dashboard: The network-io graph shows the label of “instance + region”. That is more useful than just the “instance” which is shown on all the other graphs and can be used to tune a region.
On the “fly-edge” dashboard, I made a table to show where the edge traffic was going out vs where the instances were. The heatmap and data-in/data-out indicated that there is interesting data, but it was hard to put together. I used these two queries to find out that I didn’t pick the optimal regions. Something like this might be handy on the fly-app dashboard, too. For instance, I have AMS as one of my regions, but the majority of the traffic is exiting out of FRA, etc. I think the “heatmap” on fly-app could be more useful as a table perhaps.
label_uppercase(sum(rate(fly_edge_data_out{app="appname-prod"}[$__range]))by(region), "region") or 0
label_uppercase(sum(rate(fly_instance_net_sent_bytes{app="appname-prod"}[$__range]))by(region), "region") or 0
OK, The heatmap on fly app dashboard is actually that, but I didn’t understand how to read it. The circles are the instances and then the heatmap itself shows where the data is exiting from the edges.
There’s now a tooltip on the Data Out map for better contextual info (and to help distinguish between instance and edge- I agree it’s a bit hard to understand). I’ve also squeezed in a table displaying the same data in table form.
Not exactly sure what happened, but there’s something going on. For a while, none of the pre-configured dashboards were available and now there seems to be an old version without the region labels on some of the graphs.
@wjordan do you have any resources on how to configure/connect to the Prometheus on Fly datasource for our custom Grafana instance we’re already running on Fly?
Metrics on Fly · Fly Docs should cover connecting a custom Grafana instance to the built-in Prometheus datasource. In short, you connect to https://api.fly.io/prometheus/<org-slug>/, passing your access token in a Authorization: Bearer <token> request header.
I created a new organization, and it was impossible to switch to that organization until I followed these secret, forum-only instructions to log out and log back in.
Thanks for reporting this issue! Organizations are currently only synchronized with the Grafana service when an access token is created/updated (every 2 hours), so there is an unfortunate delay if you add an org while already signed into Grafana. We should be able to handle this edge-case better with a bit of work.
A sign-out link in Grafana is also still in the works, hitting the logout url manually is just a workaround until then.
We’ve had some discussions, I understand how useful it would be and it’s something we would love to add eventually, but it adds an extra layer of complexity to the service that will take some time to sort out all the details. So no promises or ETAs but we’ll see how it goes!
Will there be a SLO for durability of the dashboards when this leaves preview? I had set up a few dashboards at one point and it got deleted (org ID 31932 and dashboard slugs HdfNhGW4z and TdXRjVZ4z created on ~August 21 and lost on ~August 23?).
While we don’t currently have published SLAs, it is a goal of this feature for customers to be able to create their own custom dashboards on this managed instance, and for those dashboards to persist. Data is stored on a Fly Volume so durability expectations are consistent with that feature.
That said, while this service is still in preview things are being changed around quickly, and we’re still working through some application-level bugs/issues that can impact custom settings or dashboard data. On Aug 27-28, a bug in the dashboard-provisioning logic caused all existing dashboards to get unexpectedly deleted, and we didn’t recover this lost data from the daily backups in time. Apologies to you and any other early-adopters who lost work as a result. Making sure something like that doesn’t happen again is a top priority for this feature.
On the 22nd, for some reason all the VM’s moved to SJC. I restarted them and as you can see the concurrency has increased quite a bit. However, it is not due to traffic, nor does the open socket count on the application reflect this. For instance, there are not 150 sockets open i on f5c0bfef-dfw.
Is there an API for retrieving billing and costs? I have a dashboard for our hourly costs for some other services and it would be neat to see hourly cost graphs as apps and things are created and destroyed