GPU L40S machines crash mid-validation with "Possible seccomp violation", yolo11m specific

Hi,

We’re running PyTorch training jobs on L40S GPU machines (ord region) and hitting a consistent crash during the validation step of epoch 1:

==== Possible seccomp violation ====

Virtual machine exited abruptly

What we’ve ruled out:

  • Not memory — happens at both 24GB and 32GB RAM

  • Not batch size — tried 8, 16, 32

  • Not dataset — happens on two different datasets

  • Model-specific: only yolo11m crashes. yolo11n, yolo11s, yolo11l, yolo11x all run fine on the same L40S machines with identical config

  • Crash always happens mid-validation (~batch 79/157), never during training

  • Moving to A100 fixes it — confirms this is L40S-specific

Setup: PyTorch 2.5.1, CUDA 12.4, Ultralytics YOLO, region ord

Has anyone seen this or found a workaround?

I don’t have GPU experience, but from your description, I’d say this is not Fly-specific, and could/should be reported to PyTorch or one of the downstream libraries.