I have an app running a celery worker that keeps getting killed due to OOM errors. E.g., I see:
2023-07-23T17:06:16Z app[2c15f24d] lax [info][ 6.339563] Out of memory: Killed process 282 (celery) total-vm:193288kB, anon-rss:122724kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:388kB oom_score_adj:0
...
2023-07-23T17:06:17Z app[2c15f24d] lax [info] raise WorkerLostError('Could not start worker processes')
I thought I was running out of memory, but then I realized that when I view only the logs of the machine running my celery worker, everything looks fine.
Indeed, it turns out that the machine id with the OOM errors (2c15f24d) does not show up anywhere when I run fly machines list
~/w/m/cloak_and_dagger main• ❱ fly machine list
2 machines have been retrieved from app cloak-and-dagger.
View them in the UI here
cloak-and-dagger
ID NAME STATE REGION IMAGE IP ADDRESS VOLUME CREATED LAST UPDATED APP PLATFORM PROCESS GROUP SIZE
32874549f5dee8 wandering-grass-708 started lax cloak-and-dagger:deployment-01H5ZKRG9C56546165M7ZQWG0M fdaa:0:d0ec:a7b:f8:5545:c39b:2 2023-07-13T03:44:07Z 2023-07-23T17:18:59Z v2 worker shared-cpu-1x:2048MB
e78496ef455083 restless-cherry-3717 started lax cloak-and-dagger:deployment-01H5ZKRG9C56546165M7ZQWG0M fdaa:0:d0ec:a7b:c0da:d29f:6f4e:2 2023-07-13T03:12:31Z 2023-07-23T17:19:07Z v2 web shared-cpu-1x:512MB
So It appears I have a zombie machine that keeps trying to run my app and which is running out of memory. I cannot stop the machine 2c15f24d:
~/w/m/cloak_and_dagger main• ❱ fly machines stop 2c15f24d
Sending kill signal to machine 2c15f24d...
Error: could not stop machine 2c15f24d: failed to stop VM 2c15f24d: invalid machine ID, '2c15f24d'
I upgraded to appsv2 a few weeks ago and this may have something to do with it. Is this perhaps an old VM running appsv1? I don’t know. I tried to kill the machine using fly vm stop:
~/w/m/cloak_and_dagger main• [1] fly vm stop 2c15f24d
VM 2c15f24d is being stopped
And while it seemed promising it doesn’t look like anything is happening. I can still type fly logs -i 2c15f24d and it shows me logs for this zombie machine.
To make matters worse, as I composed this topic, the issue happened again. But now with a new machine id: a1992ae1. So it appears as though new zombie instances can come and go and I have no way to stop them.
Anyone know how I can fix this? It is polluting my logs and I have no way to control it.