I’m using a 16 vCPU, 64GB machine with a L40S on it, and my task is simple. I render a few thousand frames on headless browser using Remotion (Node.js) and encode a video if it using FFMPEG.
During rendering frames I’m constantly getting:
INFO Main child exited normally with code: 130 WARN Reaped child process with pid: 894 and signal: SIGKILL, core dumped? false WARN Reaped child process with pid: 922 and signal: SIGKILL, core dumped? false … reboot: Power down
I have a pretty regular toml file, I autoscale from 0, I stop machines when idle, and I use a soft and hard limit of 0 for concurrency. So I’m sort of mimicking a lambda situation here.
I was wondering if any of you have experienced the same errors? What was it for, and how did you manage to fix it?
How does computation in this machine work? Do you use an HTTP request to start the computation, and does the HTTP client wait for the computation to complete while holding the connection open? If nothing holds the connection open, then fly-proxy is free to scale the instance down (i.e. stop it) since from its point of view the machine is serving 0 requests.
If you do need background task to keep running even without an active client-side request, you might want to consider disabling autostop and instead have your machine exit (by exitting the main process) once it is done with processing all jobs.
It depends on how the process inside the machine reacts to the kill signal. Exit code 130 means the process was killed with SIGINT and is the default behavior for a process without a custom signal handler. SIGINT is also the default kill_signal we send to machines on stop.
Good point… Yes I’m starting job using a POST req, and the client doesn’t await it, it’s a background job. Since Fly is dropping it in the middle of the pipeline, how would adding manual exits help…?
Setting auto_stop_machines to off would prevent Fly from stopping the machine at all, and then adding a manual exit after the job is done guarantees that your machine will only be stopped when it knows it is done. This is the recommended configuration for machines with client-side triggered long-running jobs, since there’s no good way for the Fly platform to tell whether the job in your machine is actually done or not without active connections.
Yes, just exit the process with code 0 when it is done. That’ll put the machine into a stopped state – and you can rely on autostart to get the machine to start again if a new request comes in.