Custom app metrics not making it to grafana cloud

Hi there -

I am trying to use PromEx to instrument my app. I’m following this guide, but I’m missing something. I’ve configured the prometheus datasource in grafana cloud. When I visit https://<my app>.fly.dev/metrics I see my application metrics from PromEx. The default PromEx dashboards get created when my app deploys, but they are just empty. Here’s an abbreviated version of my fly.toml:

app = "myapp"

kill_signal = "SIGTERM"
kill_timeout = 5
processes = []

[metrics]
port = 4000
path = "/metrics"

[env]

[experimental]
  allowed_public_ports = []
  auto_rollback = true

[[services]]
  http_checks = []
  internal_port = 4000
  processes = ["app"]
  protocol = "tcp"
  script_checks = []

I tried using the default config from the docs and I started getting the default fly metrics in my grafana cloud instance:

[metrics]
    port = 9091
    path = “/metrics”

However, reverting back to port = 4000 doesn’t seem to do anything as no matter which port I specify I get the default fly metrics and nothing else. Ideally, I’d have both the default fly metrics and my custom app metrics, any on how to configure it correctly?

Hi Kevin,

Just to confirm, have you been through this guide? - Feature preview: Custom metrics

Mainly, checking if the relevant app logic is binding to 0.0.0.0, rather than localhost or 127.0.0.1.

I have been through that guide and this sounds like it could be the issue, although, I’m not quite sure how to bind to 0.0.0.0

If I set my fly.toml to look at port 4000 (where my app is running) like this:

[metrics]
port = 4000
path = "/metrics"

I can see the prometheus scraper trying to get the endpoint but getting redirected because I have force ssl on: [info] Plug.SSL is redirecting GET /metrics to https://zero-staging.fly.dev with status 301.

Has anyone had success following this guide or configuring PromEx w/ fly?

Hi @Kevin_Curtin! I wonder if the problem is around the force_ssl setting.

Can you check and see if this post helps?

@Mark thanks for the suggestion… I made the change, still seeing the 301 redirect in the logs, except this time the IP is the host instead of the fly.dev domain name, so in config/prod.exs

config :myapp, MyAPpWeb.Endpoint,
  force_ssl: [rewrite_on: [:x_forwarded_proto], host: nil],
  cache_static_manifest: "priv/static/cache_manifest.json"

Log output:

[info] Plug.SSL is redirecting GET /metrics to https://123.12.1.234 with status 301

I assumed you’ve checked out the linked repo? https://github.com/fly-apps/elixir_prom_ex_example/tree/master/todo_list

Are you doing a custom domain and/or custom SSL cert?

No custom domain or cert (yet).

I’ve looked through the repo… the Endpoint config in runtime.exs is slightly different…

  config :zero, ZeroWeb.Endpoint,
    url: [host: "#{app_name}.fly.dev", port: 80],
    http: [
      ip: {0, 0, 0, 0, 0, 0, 0, 0},
      port: String.to_integer(System.get_env("PORT") || "4000")
    ],
    secret_key_base: secret_key_base

Just to summarize where I am at:

  1. Visiting https://myapp.fly.dev/metrics I see all my custom app metrics
  2. When I log in to grafana cloud, I see all of the default fly metrics + the dashboards that are uploaded via PromEx when my app boots
  3. In my logs I see what is presumable the prometheus scraper that fly configures trying to get my custom app metrics at https://myapp.fly.dev/metrics and being redirected because the req isn’t with ssl

I think the problem may be that the /metrics endpoint can’t work over ssl, so the force ssl redirect breaks our metrics scraper.

There may be a quick workaround for this.

Okay! Try this out and see if it solves it for you. The approach here is to create a separate endpoint that can be internal only to the Fly network. Then the metrics can be safely collected without SSL. Here are the changes. It’s adding a new endpoint, starting it under the supervisor and configuring it.

These changes were made to an internal app named “fizz”. I didn’t change any of the names.

config/dev.exs

config :fizz, FizzWeb.EndpointMetrics,
  # Binding to loopback ipv4 address prevents access from other machines.
  # Change to `ip: {0, 0, 0, 0}` to allow access from other machines.
  http: [ip: {127, 0, 0, 1}, port: 4001]

config/prod.exs

config :fizz, FizzWeb.EndpointMetrics,
  # Bind to `ip: {0, 0, 0, 0}` to allow access from external scraper.
  http: [ip: {0,0,0,0}, port: 4001]

config/runtime.exs

# Start the phoenix server if environment is set and running in a release
if System.get_env("PHX_SERVER") && System.get_env("RELEASE_NAME") do
  config :fizz, FizzWeb.Endpoint, server: true
  config :fizz, FizzWeb.EndpointMetrics, server: true  # <- added
end

fly.toml

[metrics]
  port = 4001
  path = "/metrics"

lib/fizz/application.ex

      # Start the Endpoint (http/https)
      FizzWeb.Endpoint,
      FizzWeb.EndpointMetrics,  # <- added

lib/fizz_web/endpoint.ex

Removed the line: plug PromEx.Plug, prom_ex_module: Fizz.PromEx

lib/fizz_web/endpoint_metrics.ex - new file

defmodule FizzWeb.EndpointMetrics do
  use Phoenix.Endpoint, otp_app: :fizz

  plug PromEx.Plug, prom_ex_module: Fizz.PromEx
end

Hope that helps!

UPDATE:

PromEx lets you specify a standalone metrics_server under the PromEx supervision tree. Read more here:

https://hexdocs.pm/prom_ex/1.6.0/PromEx.Config.html

@Mark this works great, thanks for your help :raised_hands:

One additional comment… It looks like PromEx lets you specify a standalone metrics_server under the PromEx supervision tree, more info here for anyone in future:

https://hexdocs.pm/prom_ex/1.6.0/PromEx.Config.html

Very cool! Thanks for the follow-up extra info too!

I’m unable to get the new metrics_server approach to work.

prom_ex configuration

metrics_server: [
  port: 4021,
  path: "/metrics",
  protocol: :http,
  pool_size: 5,
  cowboy_opts: [ip: {0, 0, 0, 0}]
]

fly.toml

kill_signal = "SIGINT"
kill_timeout = 5
processes = []

[metrics]
  port = 4021
  path = "/metrics"

[experimental]
  allowed_public_ports = []
  auto_rollback = true

[[services]]
  http_checks = []
  internal_port = 4000
  processes = ["app"]
  protocol = "tcp"
  script_checks = []

  [services.concurrency]
    hard_limit = 25
    soft_limit = 20
    type = "connections"

  [[services.ports]]
    handlers = ["http"]
    port = 80

  [[services.ports]]
    handlers = ["tls", "http"]
    port = 443

  [[services.tcp_checks]]
    grace_period = "1s"
    interval = "15s"
    restart_limit = 0
    timeout = "2s"

I’ve confirmed that sshing into the instance and running curl 0.0.0.0:4021/metrics returns the expected metrics. However, they don’t show up in the dashboard or when querying the Fly metrics API directly. The other tough this about this I can’t figure out how to get some logs/insight into debugging it.

Update
I’ve also tried the above multiple endpoints approach with the same result.

I’m also encountering an issue all of a sudden. It seems like the /metrics polling has stopped? Before we saw these requests in the logs but now we don’t see them anymore.

@Matt_Stewart-Ronnisc The only app of yours I see with metrics configured in fly.toml is the log shipper. Which app is that config file from in your post?

@dvic I think your app may be crashing pretty frequently. At least the one I found with metrics configured. Will you run flyctl status --all and see if there’s something up with that one?

Hmm… I don’t see any crashes/restarts? Just to be clear, the app is ********site instance v95 (4c171e63-bee0-fca5-2e4d-c6305c4c0824). The output of flyctl status --all:

@dvic I just looked and confirmed there are metrics for that VM in the metrics DB. Also I guess the others weren’t crashing, I must have seen all those versions 91-94 and misread what was happening. :slight_smile:

What query are you running that’s not working? If I got to the Explore tab of Grafana with my test setup and start typing the app name, it recommends all the series I see being exported.

@kurt I just checked the Explore tab again and now I see the custom app metrics are showing up again. Maybe I was impatient? But the weird thing is it worked before and I saw :9000/metrics request in the logs (which I still don’t see btw). Oh well :man_shrugging:, it works now :slight_smile: thanks!