Custom app metrics not making it to grafana cloud

Hi there -

I am trying to use PromEx to instrument my app. I’m following this guide, but I’m missing something. I’ve configured the prometheus datasource in grafana cloud. When I visit https://<my app>.fly.dev/metrics I see my application metrics from PromEx. The default PromEx dashboards get created when my app deploys, but they are just empty. Here’s an abbreviated version of my fly.toml:

app = "myapp"

kill_signal = "SIGTERM"
kill_timeout = 5
processes = []

[metrics]
port = 4000
path = "/metrics"

[env]

[experimental]
  allowed_public_ports = []
  auto_rollback = true

[[services]]
  http_checks = []
  internal_port = 4000
  processes = ["app"]
  protocol = "tcp"
  script_checks = []

I tried using the default config from the docs and I started getting the default fly metrics in my grafana cloud instance:

[metrics]
    port = 9091
    path = “/metrics”

However, reverting back to port = 4000 doesn’t seem to do anything as no matter which port I specify I get the default fly metrics and nothing else. Ideally, I’d have both the default fly metrics and my custom app metrics, any on how to configure it correctly?

1 Like

Hi Kevin,

Just to confirm, have you been through this guide? - Feature preview: Custom metrics

Mainly, checking if the relevant app logic is binding to 0.0.0.0, rather than localhost or 127.0.0.1.

I have been through that guide and this sounds like it could be the issue, although, I’m not quite sure how to bind to 0.0.0.0

If I set my fly.toml to look at port 4000 (where my app is running) like this:

[metrics]
port = 4000
path = "/metrics"

I can see the prometheus scraper trying to get the endpoint but getting redirected because I have force ssl on: [info] Plug.SSL is redirecting GET /metrics to https://zero-staging.fly.dev with status 301.

Has anyone had success following this guide or configuring PromEx w/ fly?

Hi @Kevin_Curtin! I wonder if the problem is around the force_ssl setting.

Can you check and see if this post helps?

@brainlid thanks for the suggestion… I made the change, still seeing the 301 redirect in the logs, except this time the IP is the host instead of the fly.dev domain name, so in config/prod.exs

config :myapp, MyAPpWeb.Endpoint,
  force_ssl: [rewrite_on: [:x_forwarded_proto], host: nil],
  cache_static_manifest: "priv/static/cache_manifest.json"

Log output:

[info] Plug.SSL is redirecting GET /metrics to https://123.12.1.234 with status 301

I assumed you’ve checked out the linked repo? https://github.com/fly-apps/elixir_prom_ex_example/tree/master/todo_list

Are you doing a custom domain and/or custom SSL cert?

No custom domain or cert (yet).

I’ve looked through the repo… the Endpoint config in runtime.exs is slightly different…

  config :zero, ZeroWeb.Endpoint,
    url: [host: "#{app_name}.fly.dev", port: 80],
    http: [
      ip: {0, 0, 0, 0, 0, 0, 0, 0},
      port: String.to_integer(System.get_env("PORT") || "4000")
    ],
    secret_key_base: secret_key_base

Just to summarize where I am at:

  1. Visiting https://myapp.fly.dev/metrics I see all my custom app metrics
  2. When I log in to grafana cloud, I see all of the default fly metrics + the dashboards that are uploaded via PromEx when my app boots
  3. In my logs I see what is presumable the prometheus scraper that fly configures trying to get my custom app metrics at https://myapp.fly.dev/metrics and being redirected because the req isn’t with ssl

I think the problem may be that the /metrics endpoint can’t work over ssl, so the force ssl redirect breaks our metrics scraper.

There may be a quick workaround for this.

Okay! Try this out and see if it solves it for you. The approach here is to create a separate endpoint that can be internal only to the Fly network. Then the metrics can be safely collected without SSL. Here are the changes. It’s adding a new endpoint, starting it under the supervisor and configuring it.

These changes were made to an internal app named “fizz”. I didn’t change any of the names.

config/dev.exs

config :fizz, FizzWeb.EndpointMetrics,
  # Binding to loopback ipv4 address prevents access from other machines.
  # Change to `ip: {0, 0, 0, 0}` to allow access from other machines.
  http: [ip: {127, 0, 0, 1}, port: 4001]

config/prod.exs

config :fizz, FizzWeb.EndpointMetrics,
  # Bind to `ip: {0, 0, 0, 0}` to allow access from external scraper.
  http: [ip: {0,0,0,0}, port: 4001]

config/runtime.exs

# Start the phoenix server if environment is set and running in a release
if System.get_env("PHX_SERVER") && System.get_env("RELEASE_NAME") do
  config :fizz, FizzWeb.Endpoint, server: true
  config :fizz, FizzWeb.EndpointMetrics, server: true  # <- added
end

fly.toml

[metrics]
  port = 4001
  path = "/metrics"

lib/fizz/application.ex

      # Start the Endpoint (http/https)
      FizzWeb.Endpoint,
      FizzWeb.EndpointMetrics,  # <- added

lib/fizz_web/endpoint.ex

Removed the line: plug PromEx.Plug, prom_ex_module: Fizz.PromEx

lib/fizz_web/endpoint_metrics.ex - new file

defmodule FizzWeb.EndpointMetrics do
  use Phoenix.Endpoint, otp_app: :fizz

  plug PromEx.Plug, prom_ex_module: Fizz.PromEx
end

Hope that helps!

UPDATE:

PromEx lets you specify a standalone metrics_server under the PromEx supervision tree. Read more here:

https://hexdocs.pm/prom_ex/1.6.0/PromEx.Config.html

@brainlid this works great, thanks for your help :raised_hands:

One additional comment… It looks like PromEx lets you specify a standalone metrics_server under the PromEx supervision tree, more info here for anyone in future:

https://hexdocs.pm/prom_ex/1.6.0/PromEx.Config.html

1 Like

Very cool! Thanks for the follow-up extra info too!

I’m unable to get the new metrics_server approach to work.

prom_ex configuration

metrics_server: [
  port: 4021,
  path: "/metrics",
  protocol: :http,
  pool_size: 5,
  cowboy_opts: [ip: {0, 0, 0, 0}]
]

fly.toml

kill_signal = "SIGINT"
kill_timeout = 5
processes = []

[metrics]
  port = 4021
  path = "/metrics"

[experimental]
  allowed_public_ports = []
  auto_rollback = true

[[services]]
  http_checks = []
  internal_port = 4000
  processes = ["app"]
  protocol = "tcp"
  script_checks = []

  [services.concurrency]
    hard_limit = 25
    soft_limit = 20
    type = "connections"

  [[services.ports]]
    handlers = ["http"]
    port = 80

  [[services.ports]]
    handlers = ["tls", "http"]
    port = 443

  [[services.tcp_checks]]
    grace_period = "1s"
    interval = "15s"
    restart_limit = 0
    timeout = "2s"

I’ve confirmed that sshing into the instance and running curl 0.0.0.0:4021/metrics returns the expected metrics. However, they don’t show up in the dashboard or when querying the Fly metrics API directly. The other tough this about this I can’t figure out how to get some logs/insight into debugging it.

Update
I’ve also tried the above multiple endpoints approach with the same result.

I’m also encountering an issue all of a sudden. It seems like the /metrics polling has stopped? Before we saw these requests in the logs but now we don’t see them anymore.

@Matt_Stewart-Ronnisc The only app of yours I see with metrics configured in fly.toml is the log shipper. Which app is that config file from in your post?

@dvic I think your app may be crashing pretty frequently. At least the one I found with metrics configured. Will you run flyctl status --all and see if there’s something up with that one?

Hmm… I don’t see any crashes/restarts? Just to be clear, the app is ********site instance v95 (4c171e63-bee0-fca5-2e4d-c6305c4c0824). The output of flyctl status --all:

@dvic I just looked and confirmed there are metrics for that VM in the metrics DB. Also I guess the others weren’t crashing, I must have seen all those versions 91-94 and misread what was happening. :slight_smile:

What query are you running that’s not working? If I got to the Explore tab of Grafana with my test setup and start typing the app name, it recommends all the series I see being exported.

@kurt I just checked the Explore tab again and now I see the custom app metrics are showing up again. Maybe I was impatient? But the weird thing is it worked before and I saw :9000/metrics request in the logs (which I still don’t see btw). Oh well :man_shrugging:, it works now :slight_smile: thanks!

1 Like