Possible network problem (periodic spiking latency) in LHR/on host 81b8?

I now suspect it isn’t a network problem but some sort of CPU locking/IO wait problem (possibly only on shared CPUs?)…

If I run (i.e. to disk):

top -b -d 1 > /root/top.txt

I’ve noticed that missing top data correlates with the periods of the problem:

#!/bin/bash
mapfile -t logs_arr < <( grep "top - " /root/top.txt | cut -d " " -f 3 )
prev_log_seconds="00"
line_num=1
for log_line in "${logs_arr[@]}"
do
    current_log_seconds=$(echo $log_line | cut -d ":" -f 3)
    seconds_diff=$((10#$current_log_seconds-10#$prev_log_seconds));
    prev_log_seconds=$current_log_seconds
    if [ $line_num != 1 ] && [ $seconds_diff != -59 ] && [ $seconds_diff != 1 ] && [ $seconds_diff != 2 ]; then
        echo $line_num $seconds_diff $log_line
    fi
    line_num=$((line_num+1))
done

I then ran (log to ramdisk?):

top -b -d 1 > /dev/shm/top.txt

and didn’t miss top data for the problem periods but I can now see 100.0 wa in those times:

grep -B 2 "100.0 wa" /dev/shm/top.txt

The 81b8 VM I’m debugging on is the one that’s not doing much (if anything). Per Understanding FIRECRACKER LOAD AVERAGE (Part Deux) - shared CPU I’m not sure how the metrics top provides can be expected to differ in a Firecracker environment.

That this correlates with my observed problems, and only on 81b8, it would be good if someone could perform the tests above on their shared CPU VM(s) on LHR host 81b8.

Unhelpfully it doesn’t appear as bad this morning as last night - where I think I had at least one 14 second bout.

2 Likes