Deployed & Running DataDog Agent crashes due to failing health check

chasehippen · May 6, 2022, 8:45pm

I’ve got to the point where a DataDog agent is running on a fly.io host, and my app running on the other host is properly configured to target the datadog agent, and I think I got the fly.toml file correct for the datadog agents networking, but I’m struggling to figure out how to get it to pass a fly.io healthcheck.

fly.io automatically crashes out a deployment if it doesn’t pass a health check specified in the fly.toml file for the application, and if you remove the health check from the toml file, it applies a default one anyways.

I tried a http check 'get \ ’ and a tcp check with a 30 second grace period & 5 second timeout but both result in “Failed due to unhealthy allocations” errors, which crash out the VM even though Datadog seems to be running.

Below is my fly.toml (8126 is the port the datadog agent listens on):

# fly.toml file generated for dd-agent on 2022-05-05T11:33:31-07:00

app = "dd-agent"

kill_signal = "SIGINT"
kill_timeout = 5
processes = []

[env]
  PORT = "8126"

[experimental]
  allowed_public_ports = []
  auto_rollback = true

[[services]]
  internal_port = 8126
  processes = ["app"]
  protocol = "tcp"

  [services.concurrency]
    hard_limit = 25
    soft_limit = 20
    type = "connections"

  [[services.ports]]
    force_https = true
    handlers = ["http"]
    port = 80

  [[services.ports]]
    handlers = ["tls", "http"]
    port = 443

  [[services.http_checks]]
    interval = 10000
    grace_period = "30s"
    method = "get"
    path = "/"
    protocol = "http"
    restart_limit = 0
    timeout = 5000
    tls_skip_verify = true

[[statics]]
  guest_path = "/app/public"
  url_prefix = "/static/"

chasehippen · May 6, 2022, 9:37pm

Finally got it working with this fly.toml:

app = "dd-agent"

kill_signal = "SIGINT"
kill_timeout = 5
processes = []

[env]
  PORT = "8126"

[experimental]
  allowed_public_ports = []
  auto_rollback = true

[[services]]
  internal_port = 8126
  processes = ["app"]
  protocol = "tcp"

  [services.concurrency]
    hard_limit = 25
    soft_limit = 20
    type = "connections"

  [[services.ports]]
    force_https = true
    handlers = ["http"]
    port = 80

  [[services.ports]]
    handlers = ["tls", "http"]
    port = 8126

  [[services.tcp_checks]]
    grace_period = "30s"
    interval = "15s"
    restart_limit = 0
    timeout = "10s"

[[statics]]
  guest_path = "/app/public"
  url_prefix = "/static/"

kurt · May 6, 2022, 10:21pm

Whoah nice. Hopefully we’ll have sidecars someday and this’ll be way easier.

ykd · May 31, 2022, 8:53pm

thank you! would you please also share the first part where you setup a DataDog agent on a fly.io host and also set up your app to connect to it ?

Topic		Replies	Views
Metrics from Go app hosted on fly.io not ending up in datadog	4	888	May 13, 2022
Fly app keeps failing the health check	6	752	June 15, 2023
Deploying Datadog Agent to enable sending metrics to Datadog from Fly.io app Build debugging launcher	6	3310	April 6, 2024
Does my fly.toml look correct? Questions / Help	3	440	February 11, 2022
fly-log-shipper failing in fly app Questions / Help logs	1	600	February 24, 2023

Deployed & Running DataDog Agent crashes due to failing health check

Related topics