I have an app running a celery worker that keeps getting killed due to OOM errors. E.g., I see:
2023-07-23T17:06:16Z app[2c15f24d] lax [info][ 6.339563] Out of memory: Killed process 282 (celery) total-vm:193288kB, anon-rss:122724kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:388kB oom_score_adj:0 ... 2023-07-23T17:06:17Z app[2c15f24d] lax [info] raise WorkerLostError('Could not start worker processes')
I thought I was running out of memory, but then I realized that when I view only the logs of the machine running my celery worker, everything looks fine.
Indeed, it turns out that the machine id with the OOM errors (
2c15f24d) does not show up anywhere when I run
fly machines list
~/w/m/cloak_and_dagger main• ❱ fly machine list 2 machines have been retrieved from app cloak-and-dagger. View them in the UI here cloak-and-dagger ID NAME STATE REGION IMAGE IP ADDRESS VOLUME CREATED LAST UPDATED APP PLATFORM PROCESS GROUP SIZE 32874549f5dee8 wandering-grass-708 started lax cloak-and-dagger:deployment-01H5ZKRG9C56546165M7ZQWG0M fdaa:0:d0ec:a7b:f8:5545:c39b:2 2023-07-13T03:44:07Z 2023-07-23T17:18:59Z v2 worker shared-cpu-1x:2048MB e78496ef455083 restless-cherry-3717 started lax cloak-and-dagger:deployment-01H5ZKRG9C56546165M7ZQWG0M fdaa:0:d0ec:a7b:c0da:d29f:6f4e:2 2023-07-13T03:12:31Z 2023-07-23T17:19:07Z v2 web shared-cpu-1x:512MB
So It appears I have a zombie machine that keeps trying to run my app and which is running out of memory. I cannot stop the machine
~/w/m/cloak_and_dagger main• ❱ fly machines stop 2c15f24d Sending kill signal to machine 2c15f24d... Error: could not stop machine 2c15f24d: failed to stop VM 2c15f24d: invalid machine ID, '2c15f24d'
I upgraded to appsv2 a few weeks ago and this may have something to do with it. Is this perhaps an old VM running appsv1? I don’t know. I tried to kill the machine using
fly vm stop:
~/w/m/cloak_and_dagger main•  fly vm stop 2c15f24d VM 2c15f24d is being stopped
And while it seemed promising it doesn’t look like anything is happening. I can still type
fly logs -i 2c15f24d and it shows me logs for this zombie machine.
To make matters worse, as I composed this topic, the issue happened again. But now with a new machine id:
a1992ae1. So it appears as though new zombie instances can come and go and I have no way to stop them.
Anyone know how I can fix this? It is polluting my logs and I have no way to control it.