Sometimes machines hang and don't shutdown themselves (and I get billed for it)

I have an architecture to process background jobs, I create worker machines to process a job and then immediately shutdown (and delete them).
Each worker machine usually takes about 1 or 2 minutes to process the job.

Sometimes, I notice that worker machines are up and running on the UI. I check the logs and see things like this:

		2025-01-09 08:02:57.506	
[ 6791.699466] rcu: 	(detected by 4, t=1696445 jiffies, g=-575, q=150333 ncpus=8)
2025-01-09 08:02:57.506	
[ 6791.699466] rcu: 	(detected by 4, t=1696445 jiffies, g=-575, q=150333 ncpus=8)
2025-01-09 08:02:57.505	
[ 6791.699032] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=339324
2025-01-09 08:02:57.505	
[ 6791.699032] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=339324
2025-01-09 08:02:57.505	
[ 6791.698547] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=339324
2025-01-09 08:02:57.505	
[ 6791.698547] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=339324
2025-01-09 08:02:57.504	
[ 6791.698099] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-09 08:02:57.504	
[ 6791.698099] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-09 08:00:00.496	
[ 6614.689444] rcu: 	(detected by 4, t=1652190 jiffies, g=-575, q=146998 ncpus=8)
2025-01-09 08:00:00.492	
[ 6614.685408] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=330472
2025-01-09 08:00:00.488	
[ 6614.681381] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=330472
2025-01-09 08:00:00.484	
[ 6614.678098] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-09 07:57:03.460	
[ 6437.659674] rcu: 	(detected by 4, t=1607935 jiffies, g=-575, q=143671 ncpus=8)
2025-01-09 07:57:03.459	
[ 6437.659100] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=321622
2025-01-09 07:57:03.459	
[ 6437.658550] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=321622
2025-01-09 07:57:03.458	
[ 6437.658081] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-09 07:54:06.436	
[ 6260.639421] rcu: 	(detected by 4, t=1563680 jiffies, g=-575, q=140371 ncpus=8)
2025-01-09 07:54:06.436	
[ 6260.638946] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=312770
2025-01-09 07:54:06.435	
[ 6260.638486] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=312770
2025-01-09 07:54:06.435	
[ 6260.638073] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-09 07:51:09.413	
[ 6083.619840] rcu: 	(detected by 4, t=1519425 jiffies, g=-575, q=137027 ncpus=8)
2025-01-09 07:51:09.413	
[ 6083.619311] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=303918
2025-01-09 07:51:09.412	
[ 6083.618707] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=303918
2025-01-09 07:51:09.412	
[ 6083.618056] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-09 07:48:12.390	
[ 5906.599453] rcu: 	(detected by 4, t=1475170 jiffies, g=-575, q=133693 ncpus=8)
2025-01-09 07:48:12.389	
[ 5906.598949] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=295066
2025-01-09 07:48:12.389	
[ 5906.598422] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=295066
2025-01-09 07:48:12.388	
[ 5906.598041] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-09 07:45:15.366	
[ 5729.579285] rcu: 	(detected by 4, t=1430915 jiffies, g=-575, q=130361 ncpus=8)
2025-01-09 07:45:15.366	
[ 5729.578862] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=286214
2025-01-09 07:45:15.366	
[ 5729.578404] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=286214
2025-01-09 07:45:15.365	
[ 5729.578031] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-09 07:42:18.344	
[ 5552.559716] rcu: 	(detected by 4, t=1386660 jiffies, g=-575, q=127057 ncpus=8)
2025-01-09 07:42:18.343	
[ 5552.559054] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=277362
2025-01-09 07:42:18.342	
[ 5552.558498] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=277362
2025-01-09 07:42:18.342	
[ 5552.558025] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-09 07:39:21.320	
[ 5375.539407] rcu: 	(detected by 4, t=1342405 jiffies, g=-575, q=123720 ncpus=8)
2025-01-09 07:39:21.320	
[ 5375.538943] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=268510
2025-01-09 07:39:21.319	
[ 5375.538439] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=268510
2025-01-09 07:39:21.319	
[ 5375.538012] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-09 07:36:24.297	
[ 5198.519342] rcu: 	(detected by 4, t=1298150 jiffies, g=-575, q=120385 ncpus=8)
2025-01-09 07:36:24.296	
[ 5198.518864] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=259658
2025-01-09 07:36:24.296	
[ 5198.518392] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=259658
2025-01-09 07:36:24.295	
[ 5198.518005] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-09 07:33:27.274	
[ 5021.499391] rcu: 	(detected by 4, t=1253895 jiffies, g=-575, q=117093 ncpus=8)
2025-01-09 07:33:27.273	
[ 5021.498956] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=250808
2025-01-09 07:33:27.273	
[ 5021.498429] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=250808
2025-01-09 07:33:27.272	
[ 5021.497995] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-09 07:30:30.251	
[ 4844.479538] rcu: 	(detected by 6, t=1209640 jiffies, g=-575, q=113738 ncpus=8)
2025-01-09 07:30:30.250	
[ 4844.479057] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=241956
2025-01-09 07:30:30.250	
[ 4844.478476] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=241956
2025-01-09 07:30:30.249	
[ 4844.477984] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-09 07:27:33.227	
[ 4667.459363] rcu: 	(detected by 4, t=1165385 jiffies, g=-575, q=110417 ncpus=8)
2025-01-09 07:27:33.227	
[ 4667.458870] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=233104
2025-01-09 07:27:33.226	
[ 4667.458388] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=233104
2025-01-09 07:27:33.226	
[ 4667.457974] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-09 07:24:36.204	
[ 4490.439251] rcu: 	(detected by 4, t=1121130 jiffies, g=-575, q=107121 ncpus=8)
2025-01-09 07:24:36.203	
[ 4490.438781] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=224254
2025-01-09 07:24:36.203	
[ 4490.438332] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=224254
2025-01-09 07:24:36.202	
[ 4490.437967] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-09 07:21:39.181	
[ 4313.419366] rcu: 	(detected by 4, t=1076875 jiffies, g=-575, q=103793 ncpus=8)
2025-01-09 07:21:39.180	
[ 4313.418897] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=215402
2025-01-09 07:21:39.180	
[ 4313.418394] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=215402
2025-01-09 07:21:39.179	
[ 4313.417948] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-09 07:18:42.157	
[ 4136.399303] rcu: 	(detected by 4, t=1032620 jiffies, g=-575, q=100442 ncpus=8)
2025-01-09 07:18:42.157	
[ 4136.398835] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=206550
2025-01-09 07:18:42.157	
[ 4136.398320] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=206550
2025-01-09 07:18:42.156	
[ 4136.397934] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-09 07:15:45.135	
[ 3959.379548] rcu: 	(detected by 4, t=988365 jiffies, g=-575, q=97112 ncpus=8)
2025-01-09 07:15:45.134	
[ 3959.378966] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=197698
2025-01-09 07:15:45.133	
[ 3959.378388] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=197698
2025-01-09 07:15:45.133	
[ 3959.377937] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-09 07:12:48.111	
[ 3782.359264] rcu: 	(detected by 1, t=944110 jiffies, g=-575, q=93863 ncpus=8)
2025-01-09 07:12:48.111	
[ 3782.358790] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=188846
2025-01-09 07:12:48.110	
[ 3782.358312] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=188846
2025-01-09 07:12:48.110	
[ 3782.357922] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-09 07:09:51.088	
[ 3605.339103] rcu: 	(detected by 4, t=899855 jiffies, g=-575, q=90567 ncpus=8)
2025-01-09 07:09:51.087	
[ 3605.338688] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=179994
2025-01-09 07:09:51.087	
[ 3605.338278] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=179994
2025-01-09 07:09:51.086	
[ 3605.337909] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-09 07:06:54.065	
[ 3428.319331] rcu: 	(detected by 4, t=855600 jiffies, g=-575, q=87225 ncpus=8)
2025-01-09 07:06:54.064	
[ 3428.318886] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=171142
2025-01-09 07:06:54.064	
[ 3428.318429] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=171142
2025-01-09 07:06:54.063	
[ 3428.317895] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-09 07:03:57.041	
[ 3251.299169] rcu: 	(detected by 4, t=811345 jiffies, g=-575, q=83932 ncpus=8)
2025-01-09 07:03:57.041	
[ 3251.298733] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=162291
2025-01-09 07:03:57.040	
[ 3251.298286] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=162291
2025-01-09 07:03:57.040	
[ 3251.297890] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-09 07:01:00.018	
[ 3074.279247] rcu: 	(detected by 4, t=767090 jiffies, g=-575, q=80599 ncpus=8)
2025-01-09 07:01:00.018	
[ 3074.278776] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=153439
2025-01-09 07:01:00.017	
[ 3074.278280] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=153439
2025-01-09 07:01:00.017	
[ 3074.277874] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-09 06:58:02.995	
[ 2897.259445] rcu: 	(detected by 4, t=722835 jiffies, g=-575, q=77255 ncpus=8)
2025-01-09 06:58:02.995	
[ 2897.258856] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=144587
2025-01-09 06:58:02.994	
[ 2897.258364] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=144587
2025-01-09 06:58:02.994	
[ 2897.257864] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-09 06:55:05.972	
[ 2720.239271] rcu: 	(detected by 1, t=678580 jiffies, g=-575, q=73890 ncpus=8)
2025-01-09 06:55:05.971	
[ 2720.238797] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=135736
2025-01-09 06:55:05.971	
[ 2720.238247] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=135736
2025-01-09 06:55:05.970	
[ 2720.237856] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-09 06:52:08.948	
[ 2543.219142] rcu: 	(detected by 4, t=634325 jiffies, g=-575, q=70482 ncpus=8)
2025-01-09 06:52:08.948	
[ 2543.218682] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=126885
2025-01-09 06:52:08.947	
[ 2543.218237] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=126885
2025-01-09 06:52:08.947	
[ 2543.217846] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-09 06:49:11.925	
[ 2366.199117] rcu: 	(detected by 4, t=590070 jiffies, g=-575, q=67079 ncpus=8)
2025-01-09 06:49:11.925	
[ 2366.198659] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=118033
2025-01-09 06:49:11.924	
[ 2366.198209] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=118033
2025-01-09 06:49:11.924	
[ 2366.197832] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-09 06:46:14.902	
[ 2189.179119] rcu: 	(detected by 4, t=545815 jiffies, g=-575, q=63665 ncpus=8)
2025-01-09 06:46:14.901	
[ 2189.178679] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=109181
2025-01-09 06:46:14.901	
[ 2189.178217] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=109181
2025-01-09 06:46:14.901	
[ 2189.177822] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-09 06:43:17.879	
[ 2012.159423] rcu: 	(detected by 1, t=501560 jiffies, g=-575, q=59428 ncpus=8)
2025-01-09 06:43:17.878	
[ 2012.158904] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=100329
2025-01-09 06:43:17.878	
[ 2012.158328] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=100329
2025-01-09 06:43:17.877	
[ 2012.157814] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-09 06:40:20.857	
[ 1835.139211] rcu: 	(detected by 3, t=457305 jiffies, g=-575, q=54250 ncpus=8)
2025-01-09 06:40:20.857	
[ 1835.138771] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=91477
2025-01-09 06:40:20.857	
[ 1835.138188] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=91477
2025-01-09 06:40:20.854	
[ 1835.137805] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-09 06:37:23.832	
[ 1658.119074] rcu: 	(detected by 3, t=413050 jiffies, g=-575, q=49073 ncpus=8)
2025-01-09 06:37:23.832	
[ 1658.119074] rcu: 	(detected by 3, t=413050 jiffies, g=-575, q=49073 ncpus=8)
2025-01-09 06:37:23.832	
[ 1658.118638] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=82626
2025-01-09 06:37:23.832	
[ 1658.118638] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=82626
2025-01-09 06:37:23.831	
[ 1658.118209] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=82626
2025-01-09 06:37:23.831	
[ 1658.118209] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=82626
2025-01-09 06:37:23.831	
[ 1658.117802] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-09 06:37:23.831	
[ 1658.117802] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-09 06:34:26.809	
[ 1481.099165] rcu: 	(detected by 4, t=368795 jiffies, g=-575, q=43928 ncpus=8)
2025-01-09 06:34:26.809	
[ 1481.099165] rcu: 	(detected by 4, t=368795 jiffies, g=-575, q=43928 ncpus=8)
2025-01-09 06:34:26.809	
[ 1481.098674] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=73774
2025-01-09 06:34:26.809	
[ 1481.098674] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=73774
2025-01-09 06:34:26.808	
[ 1481.098160] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=73774
2025-01-09 06:34:26.808	
[ 1481.098160] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=73774
2025-01-09 06:34:26.808	
[ 1481.097775] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-09 06:34:26.808	
[ 1481.097775] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-09 06:31:29.786	
[ 1304.079089] rcu: 	(detected by 4, t=324540 jiffies, g=-575, q=38752 ncpus=8)
2025-01-09 06:31:29.785	
[ 1304.078647] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=64922
2025-01-09 06:31:29.785	
[ 1304.078189] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=64922
2025-01-09 06:31:29.784	
[ 1304.077768] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-09 06:28:32.763	
[ 1127.059269] rcu: 	(detected by 4, t=280285 jiffies, g=-575, q=33580 ncpus=8)
2025-01-09 06:28:32.762	
[ 1127.058727] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=56070
2025-01-09 06:28:32.762	
[ 1127.058215] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=56070
2025-01-09 06:28:32.761	
[ 1127.057764] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-09 06:25:35.739	
[  950.039179] rcu: 	(detected by 7, t=236030 jiffies, g=-575, q=28444 ncpus=8)
2025-01-09 06:25:35.739	
[  950.038692] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=47218
2025-01-09 06:25:35.738	
[  950.038161] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=47218
2025-01-09 06:25:35.738	
[  950.037755] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-09 06:22:38.717	
[  773.019538] rcu: 	(detected by 4, t=191775 jiffies, g=-575, q=23264 ncpus=8)
2025-01-09 06:22:38.716	
[  773.018914] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=38366
2025-01-09 06:22:38.715	
[  773.018235] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=38366
2025-01-09 06:22:38.715	
[  773.017743] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-09 06:16:44.666	
[  418.975137] rcu: 	(detected by 4, t=103264 jiffies, g=-575, q=12906 ncpus=8)
2025-01-09 06:16:44.665	
[  418.974632] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=20663
2025-01-09 06:16:44.665	
[  418.974121] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=20663
2025-01-09 06:16:44.664	
[  418.973721] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-09 06:13:47.642	
[  241.955173] rcu: 	(detected by 4, t=59009 jiffies, g=-575, q=7676 ncpus=8)
2025-01-09 06:13:47.642	
[  241.954726] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=11811
2025-01-09 06:13:47.641	
[  241.954204] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=11811
2025-01-09 06:13:47.641	
[  241.953734] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-09 06:10:50.612	
[   64.927899] rcu: 	(detected by 4, t=14752 jiffies, g=-575, q=2501 ncpus=8)
2025-01-09 06:10:50.612	
[   64.927330] rcu: 	5-...0: (11 GPs behind) idle=d8d4/0/0x1 softirq=250/250 fqs=2959
2025-01-09 06:10:50.612	
[   64.926778] rcu: 	2-...0: (5 ticks this GP) idle=1bf4/0/0x1 softirq=234/235 fqs=2959
2025-01-09 06:10:50.612	
[   64.925703] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:

These logs show way before my application even bootstraps as I don’t see any of the init. logs I have on it. The machine is basically unresponsive but shows as started. Of course, my setTimeout I have to stop the machine after 5 minutes doesn’t trigger and the machine just stays started forever until I manually stop it.

This machine (d8dd734a271228) was in this state for over 2 hours until I manually stopped it. If I weren’t manually checking the state of the machines, I’d get charged by hours and hours of unusable compute.
This other machine (d8dd7e6fe91e08) was in this state for almost 8 hours, that’s a lot of compute I’m getting billed for… This one even logged the “exiting after 5 minutes” I have just before process.exit().

function exit() {
	logger.info(`Exiting after ${TIMEOUT_MS} miliseconds...`);
	process.exit();
}

But the process never exited and the machine kept outputting those weird logs for 8 hours.

2025-01-06 14:56:27.094	Virtual machine exited abruptly
2025-01-06 14:54:20.493	[25024.992387] rcu: 	(detected by 5, t=6254811 jiffies, g=261, q=388865 ncpus=8)
2025-01-06 14:54:20.493	[25024.991998] rcu: 	7-...0: (1 GPs behind) idle=7bd4/0/0x1 softirq=918/920 fqs=1250540
2025-01-06 14:54:20.493	[25024.991491] rcu: 	4-....: (5419251 ticks this GP) idle=b4cc/1/0x4000000000000000 softirq=875/5425008 fqs=1250540
2025-01-06 14:54:20.493	[25024.991056] rcu: 	0-...0: (1 GPs behind) idle=5b24/0/0x1 softirq=1123/1124 fqs=1250540
2025-01-06 14:54:20.493	[25024.990714] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
.
.
.
2025-01-06 08:07:11.094	[  595.833327] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-06 08:04:14.092	[  418.811666] rcu: 	(detected by 3, t=103263 jiffies, g=261, q=20843 ncpus=8)
2025-01-06 08:04:14.070	[  418.811125] rcu: 	7-...0: (1 GPs behind) idle=7bd4/0/0x1 softirq=918/920 fqs=20667
2025-01-06 08:04:14.070	[  418.810405] rcu: 	4-....: (97964 ticks this GP) idle=b4cc/1/0x4000000000000000 softirq=875/85447 fqs=20667
2025-01-06 08:04:14.069	[  418.809847] rcu: 	0-...0: (1 GPs behind) idle=5b24/0/0x1 softirq=1123/1124 fqs=20667
2025-01-06 08:04:14.068	[  418.809408] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
2025-01-06 08:02:17.221	{"level":"INFO","time":1736146937221,"pid":679,"hostname":"d8dd7e6fe91e08","name":"worker","msg":"Exiting after 30000 miliseconds..."}

I’m following the approach described here to shutdown workers machines if they are not processing any job after a few seconds. But this doesn’t seem to be reliable as the machines themselves are responsible for exiting but they cannot do so at the application level if they are in this stuck/frozen/hang state.

I guess there’s no machine configuration one could set to kill/stop machines after a certain period to avoid over-billing.

I’d like a refund for these unusable compute hours and a 100% reliable way to stop this from happening again.

Thank you

This is a weird one! First, we’ll always refund stuff you didn’t mean to spend, just email billing@fly.io (or use your support email if you have paid support).

Our kneejerk best guess at the moment is that something within the VM is consuming all the CPU with realtime priority and preventing anything else from working. I can force something kind of like this with a forkbomb. The setTimeout never happens because the event loop is waiting for CPU.

Your machines seem to write a huge amount of IO, like 20GB/s in aggregate. I think this could be related.

I don’t think we have any tooling that will help here. If the stuff in the Machine keeps running, we basically “trust” that it should be.

What I’d probably do is register an external watchdog. We have an example coordinator in a demo bash functions as a service project that manages stops from outside the Machines – which is actually necessary if you can’t trust the code: GitHub - superfly/bfaas: Bash functions-as-a-service

A simple way to handle this might be to put a proxy in between the user and the machine that does it’s own time based cancellation, then send a stop request through the API. If the stop doesn’t happen gracefully, we’ll kill it much more dramatically after the timeout.