The fly_app_concurrency metric is now more useful

The Fly.io platform comes with a fully-managed metrics solution, which exposes a number of built-in metrics by default for each app. One (arguably very important) metric is fly_app_concurrency β€” it tells you how many requests/connections each instance of your app is currently serving. This provides insight into not only the load of your app, but also how we route requests to your app, because fly-proxy performs load-balancing based on load, including starting machines as needed.

Unfortunately, this metric has not been particularly useful. There are two major issues with it that severely limit its usefulness:

  1. Long-running connections, such as Websockets and database connections, often do not get counted. This is an unfortunate side effect of garbage collection done in metrics-exporter-prometheus. GC’ing unused metrics is essential as machines and apps get created and deleted all the time on our platform, but it also means that any app who mostly handles longer connections (longer than a minute or two) will have their load metric reset to 0 very often. If you have ever wondered why your Postgres database has 0 load most of the time, this is why.
  2. An app can expose multiple ports as multiple services, but the metric does not distinguish between them. This makes it near impossible to understand how our proxy load-balances requests for each port β€” and even us have run into this problem multiple times while helping to debug customer apps.

The writer of this Fresh Produce has himself had various occasions where he really hoped this metric actually worked. So, he decided to finally fix this once and for all:

  1. Every time series in the fly_app_concurrency metric is now kept alive as long as the corresponding machine is started. Therefore, it should be present whenever it is at a non-zero value, and will not be reset to zero even if no new connections are made to the machine in a while.
  2. There is now a new label called service, making load of each exposed port in your app recorded separately. If all your app has is one single service per machine, there should not be any change to what you get from your metrics queries (other than the new label, which you can safely ignore). However, when you do have multiple services, you will now get multiple time series corresponding to each of them. If you would still prefer the old (read: broken) behavior, you may need to perform a sum over service.

Let us know if this now works better for you!

12 Likes