SIGKILL Reaped Child Processes during job on L40S VM

Hey all;

I’m using a 16 vCPU, 64GB machine with a L40S on it, and my task is simple. I render a few thousand frames on headless browser using Remotion (Node.js) and encode a video if it using FFMPEG.

During rendering frames I’m constantly getting:


INFO Main child exited normally with code: 130
WARN Reaped child process with pid: 894 and signal: SIGKILL, core dumped? false
WARN Reaped child process with pid: 922 and signal: SIGKILL, core dumped? false

reboot: Power down


I have a pretty regular toml file, I autoscale from 0, I stop machines when idle, and I use a soft and hard limit of 0 for concurrency. So I’m sort of mimicking a lambda situation here.

I was wondering if any of you have experienced the same errors? What was it for, and how did you manage to fix it?

Thanking you in advance!

How many concurrent headless instance are you running? They’re pretty resource intensive which would explain the OOM warnings.

How does computation in this machine work? Do you use an HTTP request to start the computation, and does the HTTP client wait for the computation to complete while holding the connection open? If nothing holds the connection open, then fly-proxy is free to scale the instance down (i.e. stop it) since from its point of view the machine is serving 0 requests.

If you do need background task to keep running even without an active client-side request, you might want to consider disabling autostop and instead have your machine exit (by exitting the main process) once it is done with processing all jobs.

Good question but doesn’t Fly usually exit with code 0 when it autoscales down to 0?

OP said

So I’m assuming OP is manually handling when the app is idle.

It depends on how the process inside the machine reacts to the kill signal. Exit code 130 means the process was killed with SIGINT and is the default behavior for a process without a custom signal handler. SIGINT is also the default kill_signal we send to machines on stop.

1 Like

I’m spinning up exactly one Chromium instance using Remotion itself, so it’s not even custom. I use Remotion’s own openBrowserfunction to spin it up.

Good point… Yes I’m starting job using a POST req, and the client doesn’t await it, it’s a background job. Since Fly is dropping it in the middle of the pipeline, how would adding manual exits help…?

Sorry if I confused you, no I’m using auto_stop of “stop” and a min machines of 0.

Setting auto_stop_machines to off would prevent Fly from stopping the machine at all, and then adding a manual exit after the job is done guarantees that your machine will only be stopped when it knows it is done. This is the recommended configuration for machines with client-side triggered long-running jobs, since there’s no good way for the Fly platform to tell whether the job in your machine is actually done or not without active connections.

1 Like

Just to clarify, by manual exit you mean a process.exit(1 | 0) call?

exit 0, non-zero would restart the machine I believe.

1 Like

Yes, just exit the process with code 0 when it is done. That’ll put the machine into a stopped state – and you can rely on autostart to get the machine to start again if a new request comes in.

1 Like

Thanks Peter, I’m testing this method. I’ll mark it as a solution once I get it running!