Zombie machine appears to be running, doesn't show up in fly machines list

I have an app running a celery worker that keeps getting killed due to OOM errors. E.g., I see:

2023-07-23T17:06:16Z app[2c15f24d] lax [info][    6.339563] Out of memory: Killed process 282 (celery) total-vm:193288kB, anon-rss:122724kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:388kB oom_score_adj:0
2023-07-23T17:06:17Z app[2c15f24d] lax [info]    raise WorkerLostError('Could not start worker processes')

I thought I was running out of memory, but then I realized that when I view only the logs of the machine running my celery worker, everything looks fine.

Indeed, it turns out that the machine id with the OOM errors (2c15f24d) does not show up anywhere when I run fly machines list

~/w/m/cloak_and_dagger main• ❱ fly machine list
2 machines have been retrieved from app cloak-and-dagger.
View them in the UI here

ID            	NAME                	STATE  	REGION	IMAGE                                                 	IP ADDRESS                      	VOLUME	CREATED             	LAST UPDATED        	APP PLATFORM	PROCESS GROUP	SIZE
32874549f5dee8	wandering-grass-708 	started	lax   	cloak-and-dagger:deployment-01H5ZKRG9C56546165M7ZQWG0M	fdaa:0:d0ec:a7b:f8:5545:c39b:2  	      	2023-07-13T03:44:07Z	2023-07-23T17:18:59Z	v2          	worker       	shared-cpu-1x:2048MB
e78496ef455083	restless-cherry-3717	started	lax   	cloak-and-dagger:deployment-01H5ZKRG9C56546165M7ZQWG0M	fdaa:0:d0ec:a7b:c0da:d29f:6f4e:2	      	2023-07-13T03:12:31Z	2023-07-23T17:19:07Z	v2          	web          	shared-cpu-1x:512MB

So It appears I have a zombie machine that keeps trying to run my app and which is running out of memory. I cannot stop the machine 2c15f24d:

~/w/m/cloak_and_dagger main• ❱ fly machines stop 2c15f24d
Sending kill signal to machine 2c15f24d...
Error: could not stop machine 2c15f24d: failed to stop VM 2c15f24d: invalid machine ID, '2c15f24d'

I upgraded to appsv2 a few weeks ago and this may have something to do with it. Is this perhaps an old VM running appsv1? I don’t know. I tried to kill the machine using fly vm stop:

~/w/m/cloak_and_dagger main• [1] fly vm stop 2c15f24d
VM 2c15f24d is being stopped

And while it seemed promising it doesn’t look like anything is happening. I can still type fly logs -i 2c15f24d and it shows me logs for this zombie machine.

To make matters worse, as I composed this topic, the issue happened again. But now with a new machine id: a1992ae1. So it appears as though new zombie instances can come and go and I have no way to stop them.

Anyone know how I can fix this? It is polluting my logs and I have no way to control it.

Update. I occasionally get emails from fly.io that say a process was killed due to OOM. None of my two fly machines have any OOM, so it appears I’m getting notified from the “zombie” machine. Interesting, the proposed remedy in the email is:

When you’re ready, add more RAM by running:

fly scale vm shared-cpu-0x --memory 1024 -a cloak-and-dagger

This command fails with:

Error: 'shared-cpu-0x' is an invalid machine size, choose one of: [shared-cpu-1x shared-cpu-2x shared-cpu-4x shared-cpu-8x]

Hi @jmuncaster, as you suspected, that was a Nomad VM ID. There was a Nomad job that kept trying to get an allocation up, left over after V1->V2 migration. We’ve removed that job and hopefully it should all behave now.

shared-cpu-0x is a bug in the logic that generates the OOM suggestion text! You shouldn’t hit that again, I think. But let us know if you do.

Thank you for the reply @catflydotio! Glad to get the issue resolved and help locate/fix the minor bug.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.