Deploying Datadog Agent to enable sending metrics to Datadog from Fly.io app

We are using datadog for our metrics instrumentation, and I think I’m close to a working implementation.

What I’ve done is instrument our app to use the datadog tracer and send metrics to dd-agent.internal . I then set up a Github action to run flyctl deploy --app dd-agent --image gcr.io/datadoghq/agent:7 -e DD_API_KEY=${{ secrets.API_KEY }} -e DD_SITE="datadoghq.com" -e DD_APM_NON_LOCAL_TRAFFIC=true

which I thought would deploy a new application using the datadog agent container image. This didn’t work because fly.io failed with “Error failed fetching existing app config: Could not resolve”

I discovered I need to run flyctl launch from a project directory to start a new application. How would I do this if all I need to deploy is an existing Dockerhub image? Will I have to make a new github repository, write a dockerfile, and run flyctl launch from the new repository or is there a simpler way to accomplish this?

1 Like

From my experience, yes, a direct call to fly deploy fails because the referenced app name does not exist. You have to run fly launch first, specifying the image etc there (flyctl launch).

What that does is set up the app’s name, region, organization etc. Else it wouldn’t know where to put it. That’ll write a fly.toml file (App Configuration (fly.toml)) to the folder you run it from, which is where Fly records e.g the app name, which ports to open, any healthchecks, env variables etc. Unless you already have one of those in the folder, as you can make one manually. If so, when you run fly launch it would see that file.

Beyond that, no you don’t need a github repo.

So as far as I’m aware, you will initially at least need a folder for that fly.toml file.

2 Likes

I’ve got this to the point where the DataDog agent is running on the fly.io host, and the app running on the other host is properly configured to target the datadog agent, and I think I got the fly.toml file correct for the datadog agents networking, but I’m struggling to figure out how to get it to pass a fly.io healthcheck.

fly.io automatically crashes out a deployment if it doesn’t pass a health check specified in the fly.toml file for the application, and if you remove the health check from the toml file, it applies a default one anyways.

I tried a http check 'get \ ’ and a tcp check with a 30 second grace period & 5 second timeout but both result in “Failed due to unhealthy allocations” errors, which crash out the VM even though Datadog seems to be running.

Below is my fly.toml (8126 is the port the datadog agent listens on):

# fly.toml file generated for dd-agent on 2022-05-05T11:33:31-07:00

app = "dd-agent"

kill_signal = "SIGINT"
kill_timeout = 5
processes = []

[env]
  PORT = "8126"

[experimental]
  allowed_public_ports = []
  auto_rollback = true

[[services]]
  internal_port = 8126
  processes = ["app"]
  protocol = "tcp"

  [services.concurrency]
    hard_limit = 25
    soft_limit = 20
    type = "connections"

  [[services.ports]]
    force_https = true
    handlers = ["http"]
    port = 80

  [[services.ports]]
    handlers = ["tls", "http"]
    port = 443

  [[services.http_checks]]
    interval = 10000
    grace_period = "30s"
    method = "get"
    path = "/"
    protocol = "http"
    restart_limit = 0
    timeout = 5000
    tls_skip_verify = true

[[statics]]
  guest_path = "/app/public"
  url_prefix = "/static/"

2 Likes

I made a new topic for this because it’s a new issue.

Thanks for laying the groundwork here. We’re new to datadog, and your setup helps us see how it might be configured within fly.

One thing you should check: I think your app config results in the agent being publicly reachable on port 8126. I’m not actually sure how to lock this down (related feature request). Sorry I couldn’t DM this.

2 Likes

What worked for me is to set DD_BIND_HOST=fly-global-services, and use fly-global-services for the statsd host.

app = "sm-datadog-agent"
kill_signal = "SIGINT"
kill_timeout = 5
processes = []

[build]
image = "datadog/agent:latest"

[env]
DD_APM_ENABLED = "true"
DD_APM_NON_LOCAL_TRAFFIC = "true"
DD_BIND_HOST = "fly-global-services"
DD_LOG_LEVEL = "info"

# https://docs.datadoghq.com/tracing/trace_collection/open_standards/otlp_ingest_in_the_agent/?tab=docker
DD_OTLP_CONFIG_RECEIVER_PROTOCOLS_HTTP_ENDPOINT = "0.0.0.0:4318"

[experimental]
allowed_public_ports = []
auto_rollback = true

[[services]]
internal_port = 8125
processes = ["app"]
protocol = "udp"

[[services]]
internal_port = 8126
processes = ["app"]
protocol = "tcp"

[[services]]
internal_port = 4318
processes = ["app"]
protocol = "tcp"

Anyone figure out how to run the data through Vector instead of the DD agent? I just feel like I’m paying an unnecessary tax here by booting up a bunch of extra VMs for Vector and the DD agent. Seems like Vector should be able to route this data, right? Thought that was kind of the point of it.