I’ve got to the point where a DataDog agent is running on a fly.io host, and my app running on the other host is properly configured to target the datadog agent, and I think I got the fly.toml file correct for the datadog agents networking, but I’m struggling to figure out how to get it to pass a fly.io healthcheck.
fly.io automatically crashes out a deployment if it doesn’t pass a health check specified in the fly.toml file for the application, and if you remove the health check from the toml file, it applies a default one anyways.
I tried a http check 'get \ ’ and a tcp check with a 30 second grace period & 5 second timeout but both result in “Failed due to unhealthy allocations” errors, which crash out the VM even though Datadog seems to be running.
Below is my fly.toml (8126 is the port the datadog agent listens on):
# fly.toml file generated for dd-agent on 2022-05-05T11:33:31-07:00
app = "dd-agent"
kill_signal = "SIGINT"
kill_timeout = 5
processes = []
[env]
PORT = "8126"
[experimental]
allowed_public_ports = []
auto_rollback = true
[[services]]
internal_port = 8126
processes = ["app"]
protocol = "tcp"
[services.concurrency]
hard_limit = 25
soft_limit = 20
type = "connections"
[[services.ports]]
force_https = true
handlers = ["http"]
port = 80
[[services.ports]]
handlers = ["tls", "http"]
port = 443
[[services.http_checks]]
interval = 10000
grace_period = "30s"
method = "get"
path = "/"
protocol = "http"
restart_limit = 0
timeout = 5000
tls_skip_verify = true
[[statics]]
guest_path = "/app/public"
url_prefix = "/static/"