Deploying Datadog Agent to enable sending metrics to Datadog from Fly.io app

We are using datadog for our metrics instrumentation, and I think I’m close to a working implementation.

What I’ve done is instrument our app to use the datadog tracer and send metrics to dd-agent.internal . I then set up a Github action to run flyctl deploy --app dd-agent --image gcr.io/datadoghq/agent:7 -e DD_API_KEY=${{ secrets.API_KEY }} -e DD_SITE="datadoghq.com" -e DD_APM_NON_LOCAL_TRAFFIC=true

which I thought would deploy a new application using the datadog agent container image. This didn’t work because fly.io failed with “Error failed fetching existing app config: Could not resolve”

I discovered I need to run flyctl launch from a project directory to start a new application. How would I do this if all I need to deploy is an existing Dockerhub image? Will I have to make a new github repository, write a dockerfile, and run flyctl launch from the new repository or is there a simpler way to accomplish this?

From my experience, yes, a direct call to fly deploy fails because the referenced app name does not exist. You have to run fly launch first, specifying the image etc there (flyctl launch).

What that does is set up the app’s name, region, organization etc. Else it wouldn’t know where to put it. That’ll write a fly.toml file (App Configuration (fly.toml)) to the folder you run it from, which is where Fly records e.g the app name, which ports to open, any healthchecks, env variables etc. Unless you already have one of those in the folder, as you can make one manually. If so, when you run fly launch it would see that file.

Beyond that, no you don’t need a github repo.

So as far as I’m aware, you will initially at least need a folder for that fly.toml file.

I’ve got this to the point where the DataDog agent is running on the fly.io host, and the app running on the other host is properly configured to target the datadog agent, and I think I got the fly.toml file correct for the datadog agents networking, but I’m struggling to figure out how to get it to pass a fly.io healthcheck.

fly.io automatically crashes out a deployment if it doesn’t pass a health check specified in the fly.toml file for the application, and if you remove the health check from the toml file, it applies a default one anyways.

I tried a http check 'get \ ’ and a tcp check with a 30 second grace period & 5 second timeout but both result in “Failed due to unhealthy allocations” errors, which crash out the VM even though Datadog seems to be running.

Below is my fly.toml (8126 is the port the datadog agent listens on):

# fly.toml file generated for dd-agent on 2022-05-05T11:33:31-07:00

app = "dd-agent"

kill_signal = "SIGINT"
kill_timeout = 5
processes = []

[env]
  PORT = "8126"

[experimental]
  allowed_public_ports = []
  auto_rollback = true

[[services]]
  internal_port = 8126
  processes = ["app"]
  protocol = "tcp"

  [services.concurrency]
    hard_limit = 25
    soft_limit = 20
    type = "connections"

  [[services.ports]]
    force_https = true
    handlers = ["http"]
    port = 80

  [[services.ports]]
    handlers = ["tls", "http"]
    port = 443

  [[services.http_checks]]
    interval = 10000
    grace_period = "30s"
    method = "get"
    path = "/"
    protocol = "http"
    restart_limit = 0
    timeout = 5000
    tls_skip_verify = true

[[statics]]
  guest_path = "/app/public"
  url_prefix = "/static/"

I made a new topic for this because it’s a new issue.