Hi,
We’re running PyTorch training jobs on L40S GPU machines (ord region) and hitting a consistent crash during the validation step of epoch 1:
==== Possible seccomp violation ====
Virtual machine exited abruptly
What we’ve ruled out:
-
Not memory — happens at both 24GB and 32GB RAM
-
Not batch size — tried 8, 16, 32
-
Not dataset — happens on two different datasets
-
Model-specific: only yolo11m crashes. yolo11n, yolo11s, yolo11l, yolo11x all run fine on the same L40S machines with identical config
-
Crash always happens mid-validation (~batch 79/157), never during training
-
Moving to A100 fixes it — confirms this is L40S-specific
Setup: PyTorch 2.5.1, CUDA 12.4, Ultralytics YOLO, region ord
Has anyone seen this or found a workaround?