Deployed & Running DataDog Agent crashes due to failing health check

I’ve got to the point where a DataDog agent is running on a fly.io host, and my app running on the other host is properly configured to target the datadog agent, and I think I got the fly.toml file correct for the datadog agents networking, but I’m struggling to figure out how to get it to pass a fly.io healthcheck.

fly.io automatically crashes out a deployment if it doesn’t pass a health check specified in the fly.toml file for the application, and if you remove the health check from the toml file, it applies a default one anyways.

I tried a http check 'get \ ’ and a tcp check with a 30 second grace period & 5 second timeout but both result in “Failed due to unhealthy allocations” errors, which crash out the VM even though Datadog seems to be running.

Below is my fly.toml (8126 is the port the datadog agent listens on):

# fly.toml file generated for dd-agent on 2022-05-05T11:33:31-07:00

app = "dd-agent"

kill_signal = "SIGINT"
kill_timeout = 5
processes = []

[env]
  PORT = "8126"

[experimental]
  allowed_public_ports = []
  auto_rollback = true

[[services]]
  internal_port = 8126
  processes = ["app"]
  protocol = "tcp"

  [services.concurrency]
    hard_limit = 25
    soft_limit = 20
    type = "connections"

  [[services.ports]]
    force_https = true
    handlers = ["http"]
    port = 80

  [[services.ports]]
    handlers = ["tls", "http"]
    port = 443

  [[services.http_checks]]
    interval = 10000
    grace_period = "30s"
    method = "get"
    path = "/"
    protocol = "http"
    restart_limit = 0
    timeout = 5000
    tls_skip_verify = true

[[statics]]
  guest_path = "/app/public"
  url_prefix = "/static/"

1 Like

Finally got it working with this fly.toml:

app = "dd-agent"

kill_signal = "SIGINT"
kill_timeout = 5
processes = []

[env]
  PORT = "8126"

[experimental]
  allowed_public_ports = []
  auto_rollback = true

[[services]]
  internal_port = 8126
  processes = ["app"]
  protocol = "tcp"

  [services.concurrency]
    hard_limit = 25
    soft_limit = 20
    type = "connections"

  [[services.ports]]
    force_https = true
    handlers = ["http"]
    port = 80

  [[services.ports]]
    handlers = ["tls", "http"]
    port = 8126

  [[services.tcp_checks]]
    grace_period = "30s"
    interval = "15s"
    restart_limit = 0
    timeout = "10s"

[[statics]]
  guest_path = "/app/public"
  url_prefix = "/static/"

Whoah nice. Hopefully we’ll have sidecars someday and this’ll be way easier. :wink:

3 Likes

thank you! would you please also share the first part where you setup a DataDog agent on a fly.io host and also set up your app to connect to it :slight_smile: ?