Grafana Tempo/Agent

Hello,

Has anybody had any experience with setting up a Grafana Agent on fly.io? I’m looking to start collecting traces in Grafana Tempo.

Best regards,
Alex

1 Like

Hi @alex2

We are using the grafana agent for tracing and metrics (we also used to use it for logging but have since moved to fly logs shipper to capture log messages that are generated by fly infrastructure as well).

Our current setup is every node has its own instance of the agent running alongside our app and this has worked fine for us in the past. As we scale up we’re finding that having the agent run on every node is starting to hit rate limiting issues for tracing from Grafana Cloud.

That setup should work for you too if you’re running a small number of servers. We’re looking at running dedicated servers for the agents for tracing and investigating what to do for metrics as fly does offer its own metrics collector but currently there are tradeoffs.

If you have any specific questions let me know and I’ll try my best to answer them.

Just curious what your Dockerfile and agent yaml look like? Trying unsuccessfully to get env variables working in the agent yaml.

Hi @chasers

We have since moved to running dedicated apps for the grafana agent.

The following is for the app that runs our grafana agent for tempo.

Dockerfile


FROM grafana/agent:latest

# Add our Grafana Cloud Agent config file
COPY ./src/agent.yaml /etc/agent/agent.yaml

CMD ["--config.file=/etc/agent/agent.yaml", "--metrics.wal-directory=/etc/agent/data", "--config.expand-env=true"]

agent.yaml

server:
    log_level: warn
    http_listen_port: 12345

traces:
    configs:
        - name: default
          receivers:
              otlp:
                  protocols:
                      http:
          remote_write:
              - endpoint: tempo-us-central1.grafana.net:443
                basic_auth:
                    username: ${TEMPO_USERNAME}
                    password: ${TEMPO_PASSWORD}
          batch:
              timeout: 15s
              send_batch_size: 10000
              send_batch_max_size: 256

Make note of the extra param --config.expand-env=true thats required for using environment variables inside the agent.yaml file

Thanks!!

Yeah I was doing this and getting some error message that it couldn’t find the config.file argument. Not working a ton with Docker I’m sure I was doing something wrong.

The run.sh hack ended up working for me though.

I think this will be helpful for others.

Also I had to delete everything in the [[services]] fly.toml before the deployment would go green. It’s not serving any traffic anyways. Instance would stay up, it just didn’t look healthy.

Curious what your fly.toml looks like for your grafana agent. I’m using the flow mode, and pointing at grafana cloud. I can bring up the debg interface, and all looks good, but I’m getting nxdomain (domain not found) errors from my apps trying to access it from [app_name].internal:4317

I think my problem may simply be in not understanding how traffic is routed internally. Shouldn’t all ports be accessible to other apps in my org w/out having to configure services sections?

@charleysa @chasers

I gave this a shot over the last 3 days, moderate success.

I say moderate success because I would have preferred to have fly-log-shipper push traces to Tempo, or have the grafana-agent push logs to Loki.

For Fly Log Shipper (Vector) to ship traces, we would need to see Support ingesting OpenTelemetry traces · Issue #17307 · vectordotdev/vector · GitHub and Support sending OpenTelemetry traces · Issue #17308 · vectordotdev/vector · GitHub implemented.

For Grafana agent to ship logs, someone would need to create two Grafana Agent components for Fly service discovery and NATS log stream ingestion that comes out of the box with the fly-log-shipper and vector.

Until either of those problems are solved, I will need to run two separate apps for pushing traces and logs to Grafana cloud.

Anyway, here’s my set-up:

Metrics

I have Prometheus metrics shipping to Grafana cloud: Metrics on Fly.io · Fly Docs

Logs

Hosting the fly-log-shipper pushing to Loki on Grafana cloud:

# fly.toml
app = "<your-log-shipper-name>"
primary_region = "<region>"

[build]
image = "ghcr.io/superfly/fly-log-shipper:v0.0.9"

[http_service]
internal_port = 8686
force_https = true
auto_stop_machines = true
auto_start_machines = true
min_machines_running = 1
processes = ["app"]

[[services]]
http_checks = []
internal_port = 8686

[env]
ORG = "<your-fly-org>"
LOKI_URL = "https://logs-<your-grafana-zone>.grafana.net"
LOKI_USERNAME = <123456>
# fly secrets set --stage ACCESS_TOKEN=$(fly auth token)
# fly secrets set --stage LOKI_PASSWORD= ****

# Set the machine restart policy to "always" so it doesn't turn off
# fly machine list -a $FLY_APP_NAME -j \
#  | jq -r '.[].id' \
#  | xargs -I {} fly machine update {} --restart always -y

Metrics

And finally, hosting the Grafana Agent on Fly to push traces to Tempo on Grafana cloud.

See deployment config below:

# fly deploy -c dev.fly.toml

app = "<grafana-agent-name>"
primary_region = "<region>"

[build]
dockerfile = "Dockerfile"

[[services]]
internal_port = 12345
auto_stop_machines = true
auto_start_machines = true
min_machines_running = 0

[[services]]
internal_port = 4317       # grpc
protocol = "tcp"
auto_stop_machines = true
auto_start_machines = true
min_machines_running = 0

[[services]]
internal_port = 4318       # http
protocol = "tcp"
auto_stop_machines = true
auto_start_machines = true
min_machines_running = 0


[env]
# API key generated above, eg.	eyJvSomeLongStringJ9fQ==
GRAFANA_CLOUD_API_KEY = "" # fly secrets set GRAFANA_CLOUD_API_KEY=eyJvSomeLongStringJ9fQ==

# Remote Write Endpoint from Grafana Cloud > Tempo > Details, eg.	"tempo-prod-***.grafana.net:443"
GRAFANA_CLOUD_TEMPO_ENDPOINT = "tempo-prod-16-<your-grafana-zone>.grafana.net:443"

# Username/Instance ID from the Grafana Cloud > Tempo > Details, eg. 11111
GRAFANA_CLOUD_TEMPO_USERNAME = "11111"
# Dockerfile
FROM grafana/agent:v0.38.0

COPY ./config.river /etc/agent/config.river

ENV AGENT_MODE="flow"

CMD ["run", "--server.http.listen-addr=0.0.0.0:12345", "/etc/agent/config.river"]
// config.river
otelcol.receiver.otlp "default" {
	// https://grafana.com/docs/agent/latest/static/flow/reference/components/otelcol.receiver.otlp/

	// configures the default grpc endpoint "0.0.0.0:4317"
	grpc { }
	// configures the default http/protobuf endpoint "0.0.0.0:4318"
	http { }

	output {
		traces  = [otelcol.processor.batch.default.input]
	}
}

otelcol.processor.batch "default" {
	// https://grafana.com/docs/agent/latest/flow/reference/components/otelcol.processor.batch/
	output {
		traces  = [otelcol.exporter.otlp.grafana_cloud_tempo.input]
	}
}


otelcol.exporter.otlp "grafana_cloud_tempo" {
	// https://grafana.com/docs/agent/latest/flow/reference/components/otelcol.exporter.otlp/
	client {
		endpoint = env("GRAFANA_CLOUD_TEMPO_ENDPOINT")
		auth     = otelcol.auth.basic.grafana_cloud_tempo.handler
	}
}

otelcol.auth.basic "grafana_cloud_tempo" {
  // https://grafana.com/docs/agent/latest/flow/reference/components/otelcol.auth.basic/
  username = env("GRAFANA_CLOUD_TEMPO_USERNAME")
  password = env("GRAFANA_CLOUD_API_KEY")
}

logging {
  level  = "debug"
  format = "logfmt"
}
1 Like