Cannot run telegraf on fly.io

Hi,

I am trying to run telegraf with an MQTT consumer on fly.io but it never starts after a deploy and the machines are in a stopped state. I get no logs at all - it stops at “Configuring firecracker”. Running the container locally works perfectly.

Logs:

2025-05-14 10:15:35.350   Configuring firecracker
2025-05-14 10:15:35.191   Configuring firecracker
2025-05-14 10:15:33.990	Successfully prepared image registry.fly.io/telegraf-emqx:deployment-01JV71706AWE2SJXFTJB1QJATW (881.796796ms)
2025-05-14 10:15:33.861	Successfully prepared image registry.fly.io/telegraf-emqx:deployment-01JV71706AWE2SJXFTJB1QJATW (847.777961ms)
2025-05-14 10:15:33.108	Pulling container image registry.fly.io/telegraf-emqx:deployment-01JV71706AWE2SJXFTJB1QJATW
2025-05-14 10:15:33.013	Pulling container image registry.fly.io/telegraf-emqx:deployment-01JV71706AWE2SJXFTJB1QJATW

Dockerfile:

FROM telegraf

COPY ./telegraf.conf /etc/telegraf/
COPY ./emqxsl-ca.pem /etc/telegraf/

fly.toml:

app = 'telegraf-emqx'
primary_region = 'fra'

[build]

[http_service]
  internal_port = 1883
  force_https = true
  auto_stop_machines = 'off'
  auto_start_machines = true
  min_machines_running = 1
  processes = ['app']

[[services]]
  internal_port = 8883
  protocol = "tcp"
  auto_stop_machines = "off"
  auto_start_machines = true
  min_machines_running = 1
  processes = ['app']
  [[services.ports]]
    port = 8883

[[vm]]
  memory = '1gb'
  cpu_kind = 'shared'
  cpus = 1

The MQTT consumer establishes an outbound connection (TLS port 8883) to EMQX and subscribes to topics. Telegraf writes the data to serverless influxdb over HTTPS (outbound again - regular TLS port 443). I.E. I don’t think I need any [[services]] or [[http_service]]

Things I have tried:

  • fly.toml without the [[services]].
  • fly.toml without [http_service].
  • fly.toml with neither [[services]] nor [[http_service]]
  • Adding a [[processes]] section with app = "telegraf"
  • I tried copying the entrypoint.sh from the base layer into my layer and added echo statements everywhere - they never show up.
  • I tried adding:
[experimental]
  exec = ["sleep", "1d"]

so that I could SSH in to investigate but the machine is still stopped after re-deploying.

Does anyone have any ideas?

I managed to fix this myself. I setup a health check endpoint in telegraf using the [[outputs.health]] plugin and then configured a [checks.telegraf] http check in my fly.toml and it all works now. Here’s how it looks:

fly.toml:

app = 'telegraf-emqx'
primary_region = 'fra'

[build]

[checks]
  [checks.telegraf]
    grace_period = "30s"
    interval = "15s"
    method = "get"
    path = "/"
    port = 8080
    timeout = "10s"
    type = "http"
    [checks.telegraf.headers]
      Accept = "*/*"

[[vm]]
  memory = '1gb'
  cpu_kind = 'shared'
  cpus = 1

telegraf.conf (just the check):

[[outputs.health]]
  service_address = "http://:8080"

Presumably fly kills the machine too fast for any logs to get out because it thinks there is nothing healthy running and so we need to give it a way to know that telegraf is running ok.