Flame runner is destoryed unexpectedly with no visible errors

Zsolt · February 16, 2024, 4:02pm

My flame process is failing for some reason, but I can’t debug the cause of it. The process works OK on a local backend and for smaller payloads in production.

I’m uploading and processing images, and this only occurs when I exceed a certain payload size. It is triggered at around 5 large files that are between 20-50 MB.

The only relevant flame log I get is:

2024-02-15T14:14:41Z app[0806250b612738] ams [info]14:14:41.353 [error] GenServer FLAME.Terminator.ChildPlacementSup terminating
2024-02-15T14:14:41Z app[0806250b612738] ams [info]** (stop) killed
2024-02-15T14:14:41Z app[0806250b612738] ams [info]Last message: {:EXIT, #PID<0.2625.0>, :killed}
2024-02-15T14:14:41Z app[148e461a10d638] ams [info]14:14:41.352 [error] GenServer #PID<0.2913.0> terminating
2024-02-15T14:14:41Z app[148e461a10d638] ams [info]** (stop) killed
2024-02-15T14:14:41Z app[148e461a10d638] ams [info]Last message: {:DOWN, #Reference<0.3703568104.3533176833.226316>, :process, #PID<64302.2627.0>
, :killed}                          
2024-02-15T14:14:41Z app[148e461a10d638] ams [info]State: %{runner: #FLAME.Runner<id: nil, instance_id: nil, private_ip: nil, backend: FLAME.FlyB
ackend, terminator: #PID<64302.2627.0>, node_name: nil, single_use: true, timeout: 30000, status: :booted, log: :debug, boot_timeout: 30000, idle
_shutdown_after: 30000, idle_shutdown_check: #Function<8.81159202/0 in FLAME.Runner.new/1>, ...>, checkouts: %{}, otp_app: :phoenix_albums, backe
nd_state: #FLAME.FlyBackend<host: "https://api.machines.dev", local_ip: ["fdaa:3:e5fc:a7b:c207:6cdf:6e23:2"], cpu_kind: "performance", cpus: 1, m
emory_mb: 4096, gpu_kind: nil, image: "registry.fly.io/phoenix-albums:deployment-01HPPHKRM3RX1WQ2E7DJFTC0KT", app: "phoenix-albums", boot_timeout
: 30000, runner_id: "0806250b612738", remote_terminator_pid: #PID<64302.2627.0>, runner_node_basename: "phoenix-albums-01HPPHKRM3RX1WQ2E7DJFTC0KT
", runner_instance_id: "01HPPHV48EZPS6AMZJPSCJ4B2H", runner_private_ip: "fdaa:3:e5fc:a7b:252:9f38:8dec:2", runner_node_name: :"phoenix-albums-01H
PPHKRM3RX1WQ2E7DJFTC0KT@fdaa:3:e5fc:a7b:252:9f38:8dec:2", ...>}

but this might just be the regular timeout shutdown.

I made some runs with the following Pool config

 {
          FLAME.Pool,
          name: PhoenixAlbums.ImageProcessor,
          shutdown_timeout: 1200_000,
          idle_shutdown_after: 1200_000,
          timeout: 1200_000,
          min: 1,
          max: 10,
          max_concurrency: 1,
          single_use: true,
          log: :debug
}

to ensure, that we always have at least on machine, and that it can’t possible time out for longer running processes. The results are the same: the runner still exits without any visible errors, when upload 5 images at ~120MB, in about 75s, and usually the last 2 images are not processed. The same happens for a large number of small images, for 130 files at about ~100MB only 25 are processed, and the runner exits after ~15s.

One recurring message I get every time in the logs is: Reaped child process with pid: 365 and signal: SIGUSR1, core dumped? false .

Also the runner machine is always destroyed after it exits, despite of the min: 1 option.

Why is the runner machine destroyed? How do I find the cause of this?

Zsolt · February 21, 2024, 4:23pm

The issue was resolved in version 0.1.10

system · February 28, 2024, 4:24pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Flame timeout Questions / Help elixir	3	161	February 28, 2024
FLAME Backend often stuck at "Pulling container image..." Build debugging elixir	3	357	December 12, 2023
App is down, monitoring says it's ok, how to troubleshoot? Questions / Help elixir	7	384	June 15, 2022
Rails app fails: could not find a good candidate within 90 attempts at load balancing Build debugging help-me-help-you , rails , machines	0	65	April 1, 2024
App down, machines still not starting after incident Questions / Help	1	45	April 19, 2024

Flame runner is destoryed unexpectedly with no visible errors

Related Topics