I now suspect it isn’t a network problem but some sort of CPU locking/IO wait problem (possibly only on shared CPUs?)…
If I run (i.e. to disk):
top -b -d 1 > /root/top.txt
I’ve noticed that missing top data correlates with the periods of the problem:
#!/bin/bash
mapfile -t logs_arr < <( grep "top - " /root/top.txt | cut -d " " -f 3 )
prev_log_seconds="00"
line_num=1
for log_line in "${logs_arr[@]}"
do
current_log_seconds=$(echo $log_line | cut -d ":" -f 3)
seconds_diff=$((10#$current_log_seconds-10#$prev_log_seconds));
prev_log_seconds=$current_log_seconds
if [ $line_num != 1 ] && [ $seconds_diff != -59 ] && [ $seconds_diff != 1 ] && [ $seconds_diff != 2 ]; then
echo $line_num $seconds_diff $log_line
fi
line_num=$((line_num+1))
done
I then ran (log to ramdisk?):
top -b -d 1 > /dev/shm/top.txt
and didn’t miss top data for the problem periods but I can now see 100.0 wa in those times:
grep -B 2 "100.0 wa" /dev/shm/top.txt
The 81b8 VM I’m debugging on is the one that’s not doing much (if anything). Per Understanding FIRECRACKER LOAD AVERAGE (Part Deux) - shared CPU I’m not sure how the metrics top provides can be expected to differ in a Firecracker environment.
That this correlates with my observed problems, and only on 81b8, it would be good if someone could perform the tests above on their shared CPU VM(s) on LHR host 81b8.
Unhelpfully it doesn’t appear as bad this morning as last night - where I think I had at least one 14 second bout.