Issue with configuration of custom prometheus metrics

Hello,

I am building a Django application and I have configured web and worker process in the same Fly application. Following the metrics documentation, I set up custom prometheus metrics on the “/m/metrics” path of my application. I configured the metrics section of the fly.toml file as follows

[metrics]
port = 8000
path = "/m/metrics"

When I access the HTTP route, I can see the prometheus response.
But when I want to request the prometheus metrics via the Fly api using the following script, I see nothing

import httpx
import os

# Configuration
FLY_API_TOKEN = os.getenv("FLY_API_TOKEN")  # I use a personal read only token created with the command "fly tokens create readonly my-org"
ORG_SLUG = "XXX" 
BASE_URL = f"https://api.fly.io/prometheus/{ORG_SLUG}"

# Get the size of the queue
query = 'dramatiq_queue_messages{queue="default",app="XXX"}'
# query = 'query=sum(increase(fly_edge_http_responses_count))'

headers = {
    "Authorization": FLY_API_TOKEN
}

with httpx.Client(base_url=BASE_URL, headers=headers) as client:
    response = client.get("/api/v1/query", params={"query": query})
    data = response.json()
    print(data)

    if data["status"] == "success" and data["data"]["result"]:
        queue_size = data["data"]["result"][0]["value"][1]
        print(f"Size of the 'default' queue: {queue_size}")
    else:
        print("Nothing found")

Here is the response

{'status': 'success', 'isPartial': False, 'data': {'resultType': 'vector', 'result': []}, 'stats': {'seriesFetched': '0', 'executionTimeMsec': 14}}
Nothing found

Does anyone have any idea what the potential issue I’m facing might be?

Hm… I don’t know the Python ecosystem very well, but is there a way to get client to dump the full URL that it constructed?

Also, were you having any better luck with the standard metrics? (I see one commented out up above.) Maybe try query = 'fly_instance_up' for something super-simple.

(Start a Machine a minute or so beforehand and keep it running the entire time.)

Finally, it might help if we forum readers could see the actual curl -i http://localhost:8000/m/metrics response (when that’s tried from within a Machine of each process group) as well as fly m status -d for each Machine that you want to be scraped for metrics.

Hi mayailurus,
If I try the query fly_instance_up, here is what I get

{'status': 'success', 'isPartial': False, 'data': {'resultType': 'vector', 'result': [{'metric': {'__name__': 'fly_instance_up', 'app': 'faili', 'host': '1589', 'instance': '6832e71f736668', 'region': 'cdg'}, 'value': [1769517125, '1']}, {'metric': {'__name__': 'fly_instance_up', 'app': 'faili', 'host': '2370', 'instance': 'e822054b737998', 'region': 'cdg'}, 'value': [1769517125, '1']}]}, 'stats': {'seriesFetched': '2', 'executionTimeMsec': 20}}

The URL is correct, this is what I have when I inspect the client https://api.fly.io/prometheus/XXX/api/v1/query?query=fly_instance_up

When I browse the URL /m/metrics, I get something like this:

# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 8.68
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 25.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 10240.0
# HELP dramatiq_queue_messages Number of messages in Dramatiq queue
# TYPE dramatiq_queue_messages gauge
dramatiq_queue_messages{app="faili",queue="default"} 0.0
# HELP dramatiq_delayed_messages Number of delayed messages in Dramatiq
# TYPE dramatiq_delayed_messages gauge
dramatiq_delayed_messages{app="faili",queue="default"} 0.0

Using fly m status -d on my web servers machines, I get something like this

{                                                                                                                                                                                  
  "env": {                                                                                                                                                                         
    "DJANGO_SETTINGS_MODULE": "tme_woudar_bot.settings",                                                                                                                           
    "FAILI_CACHE_BACKEND": "django.core.cache.backends.redis.RedisCache",                                                                                                          
    "FAILI_DEBUG": "false",                                                                                                                                                        
    "FAILI_FLY_VOLUME": "/data",                                                                                                                                                   
    "FLY_PROCESS_GROUP": "web",
    "PORT": "8000",
    "PRIMARY_REGION": "cdg",
    "SENTRY_ENVIRONMENT": "production"
  },
  "init": {
    "cmd": [
      "granian",
      "--interface",
      "wsgi",
      "tme_woudar_bot.wsgi:application",
      "--host",
      "0.0.0.0",
      "--port",
      "8000"
    ]
  },
  "guest": {
    "cpu_kind": "shared",
    "cpus": 2,
    "memory_mb": 2048
  },
  "metadata": {
    "fly_builder_id": "e82d623b17dd18",
    "fly_flyctl_version": "0.4.0",
    "fly_platform_version": "v2",
    "fly_process_group": "web",
    "fly_release_id": "rel_3d092nnx44r32jxm",
    "fly_release_version": "74"
  },
  "mounts": [
    {
      "encrypted": true,
      "path": "/data",
      "size_gb": 1,
      "volume": "vol_vdmene0gygjwqwwv",
      "name": "faili_data"
    }
  ],
  "services": [
    {
      "protocol": "tcp",
      "internal_port": 8000,
      "autostop": true,
      "autostart": true,
      "min_machines_running": 0,
      "ports": [
        {
          "port": 80,
          "handlers": [
            "http"
          ],
          "force_https": true
        },
        {
          "port": 443,
          "handlers": [
            "http",
            "tls"
          ]
        }
      ],
      "checks": [
        {
          "type": "http",
          "interval": "30s",
          "timeout": "10s",
          "grace_period": "10s",
          "method": "GET",
          "path": "/ht/?format=json",
          "headers": [
            {
              "name": "Host",
              "values": [
                "faili-proxy.fly.dev"
              ]
            }
          ]
        }
      ],
      "concurrency": {
        "type": "requests",
        "hard_limit": 50,
        "soft_limit": 25
      },
      "force_instance_key": null
    }
  ],
  "statics": [
    {
      "guest_path": "/app/staticfiles",
      "url_prefix": "/static",
      "tigris_bucket": "",
      "index_document": ""
    }
  ],
  "image": "registry.fly.io/faili:deployment-01KFVJPFYXYJ07EZRQ2N1ZC5AB",
  "restart": {
    "policy": "on-failure",
    "max_retries": 10
  }
}

Thank you for your time :slight_smile:

1 Like

Thanks for the details… The fly m status -d output is missing the "metrics":{...} section, which would need to be there if that Machine was going to be scraped for metrics.

Maybe try changing the stanza in fly.toml to the following:

[[metrics]]  # double brackets now.
  port = 8000
  path = "/m/metrics"
  processes = ["web"]

Aside: Other users are reporting delays in scraping and/or propagation of metrics today, so additional patience may be needed.

Thanks mayailurus,
I’m now able to see the metrics configuration when running the fly m status -d command.
But even now, I can’t see the prometheus data using the Fly HTTP endpoint :frowning:
I still have this

{'status': 'success', 'isPartial': False, 'data': {'resultType': 'vector', 'result': []}, 'stats': {'seriesFetched': '0', 'executionTimeMsec': 12}}

Hm… I created a small test Machine in cdg, and its custom metrics have been getting scraped without any problems. (It looks like there is continued woe over in nrt and sin, though.)

Were the Machines in your web process group all stopped at the time? (You have scale-to-zero enabled, it appears.)

The dramatiq_queue_messages{queue="default",app="faili"} query won’t return anything if all of those Machines have been off beyond a certain horizon (which is ~30 seconds, from what I’ve seen).

I would try starting one of the web Machines and keeping it running, waiting at least a minute before attempting the query.

Alternatively, dramatiq_queue_messages[10d] will fetch the previous 10 days…

Hi mayailurus,
When I try your query dramatiq_queue_messages[10d], I have a request error.
{'status': 'error', 'errorType': '422', 'error': 'error when executing query="dramatiq_queue_messages[10d]{queue=\\"default\\",app=\\"faili\\"}" for (time=1770064782230, step=300000): unparsed data left: "{queue=\\"default\\",app=\\"faili\\"}"'

But I found another way to query over several days using the range queries api. I end up with something like this

end_time = datetime.now()
start_time = end_time - timedelta(days=3)

query = "dramatiq_queue_messages"
params = {
    "query": query,
    "start": start_time.timestamp(),
    "end": end_time.timestamp(),
    "step": "5m"  # Interval between points (5 minutes)
}

headers = {
    "Authorization": FLY_TOKEN
}


def get_metrics():
    with httpx.Client() as client:
        response = client.get(
            f"{BASE_URL}/api/v1/query_range",
            params=params,
            headers=headers,
            timeout=30.0
        )
        response.raise_for_status()
        return response.json()

And again I see nothing.

Yes, you are right. Even if, I run it for more than one minute and query the metrics, there is nothing.

{'status': 'success', 'isPartial': False, 'data': {'resultType': 'matrix', 'result': []}, 'stats': {'seriesFetched': '0', 'executionTimeMsec': 10}}

I don’t know what I’m missing :cry:

Hm… This should have omitted the part in the curly braces ({}). I.e., it should have ended with [10d].

That said, query_range would have accomplished the same thing…

Can you add a logging statement to /m/metrics? Basically write to stdout (with flush) every time there’s a request. I see a visit every 15 seconds to my cdg test Machine.

Also, try the requests to api.fly.io with curl, like in the official docs. (If you run Windows locally then SSH into a Fly.io Machine, install curl inside, and then query api.fly.io from in there instead.)

(You should omit Bearer these days, if your token begins with FlyV1 .)


On a different note… Your /m/metrics is serving a 301 redirect to /m/metrics/, which I’m not sure is allowed. The health checker is officially documented as not following redirects, so there might be a similar restriction on the metrics scraper, :thinking:.


If all that fails, then I’d create a small test app that only does metrics—without the complications of process groups, auto-stop, etc.—and put it in ewr.

(The cheapest region.)

Finally, it should be possible to see custom metrics with the built-in Grafana instance, although you have to fiddle a bit with Explore and type in your own query. (It’s not the easiest of interfaces, unfortunately.)