Possible network problem (periodic spiking latency) in LHR/on host 81b8?

I have automated TCP health-checks that are regularly failing (and quickly recovering), at the same time, between VMs on the following hosts:

79f0 to 81b8
5423 to 81b8
d88f to 81b8

ICMP ping6, at least d88f to 81b8/5423 to 81b8 (I’ve not yet tried other cross-VM ping6 combinations/TBC), also show a large increase in response times (e.g. 6484 ms) at the same times as the TCP health-check failures.

Though I’m yet to gather a definitive body of evidence, is there a possible network problem (periodic spiking latency) either on a specific LHR host (81b8), or LHR in general?

This appears to occur for a few/10 seconds at a time, every 15/20 minutes or so (sometimes longer).

After further investigation it appears I have another app that is on host 81b8. Though not being health-checked (no Fly-private network/internal comms required) it also appears to have regular latency spikes across the internal network.

In summary: for ping6 from various LHR app VMs (single VM per app) to various other LHR apps (single VM per app) - only those destined to apps hosted on 81b8 (2 of) display this issue.

As far as I can tell the problem packets are delayed in arriving at 81b8, in some cases up to 8 seconds after being sent (stuck in a buffer? - 81b8 RX/Fly Internal Wireguard/etc?).

If possible, please could someone from Fly undertake the following:

Spin up some app VMs (single VM per app), at least a couple on LHR host 81b8, then on a non-81b8 app VM run (in the background/parallel):

ping6 <dst-app-name1>.internal -D > ping.<src app name>-<dst-app-name1>.txt
ping6 <dst-app-name2>.internal -D > ping.<src app name>-<dst-app-name2>.txt
ping6 <dst-app-name3>.internal -D > ping.<src app name>-<dst-app-name3>.txt
ping6 <dst-app-name4>.internal -D > ping.<src app name>-<dst-app-name4>.txt
ping6 <dst-app-name5>.internal -D > ping.<src app name>-<dst-app-name5>.txt

Then also run:

watch ‘egrep “time=[0-9]{4} ms” ping.<source app name>-*.txt | wc -l’

I suspect any increase in the count (possibly a few times an hour) will be isolated to only those destination apps that are located on LHR host 81b8.

I appreciate that ICMP isn’t the best way to test - but it is representative of when the actual problem (TCP issues) occur, and does seem to reflect the (as far as I can tell) host-specific nature of the problem.

If (TBC) packets are periodically going walkabout for seconds at a time within Fly’s network - this could cause all sorts of hard to diagnose issues (slow responses/disconnects/etc) for people’s Fly apps (possibly just on LHR 81b8) that are reliant on the Fly private network (e.g. PostgreSQL).

If anyone suspects they may have this problem in future/other regions or hosts, the following Bash script (of which no doubt could be improved) may help confirm if this is the case:

#!/bin/bash
# Fly Private Network ICMP Latency Checker v0.1
start_timestamp=$(date +"%s")
apps_dig=$(dig +short txt _apps.internal @fdaa::3 | cut -d '"' -f 2)
apps_arr=($(echo $apps_dig | tr "," "\n"))
for dst_app_name in "${apps_arr[@]}"
do
    app_ip_arr=($(dig +short aaaa $FLY_REGION.$dst_app_name.internal @fdaa::3))
    for app_ip in "${app_ip_arr[@]}"
    do
        ping6 $app_ip -D | stdbuf -o0 egrep "time=[0-9]{4,5} ms" >> ~/flypingtest.$FLY_REGION.$FLY_APP_NAME.$HOSTNAME.$dst_app_name.log &
    done
done
watch --interval 60 "wc -l ~/flypingtest.*"
killall ping6
mapfile -t logs_arr < <( egrep --with-filename time ~/flypingtest.* )
for log_line in "${logs_arr[@]}"
do
    log_timestamp=$(echo $log_line | egrep -o "[0-9]{10}")
    if [ $log_timestamp -gt $start_timestamp ]; then
        response_delay_time=$(echo $log_line | egrep -o "time=[0-9]{4,5}")
        response_delay_ms=$(echo $response_delay_time | cut -d "=" -f 2)
        response_delay_s=$((response_delay_ms / 1000))
        request_sent_timestamp=$((log_timestamp-response_delay_s))
        request_sent_date=$(date -d @$request_sent_timestamp)
        echo $request_sent_date $log_line
    fi
done

Note: it requires that the Fly VM within which it is running has Bash/dig/etc installed.

The script will ping6 all of your various apps’ ipv6 IPs within the same region as the VM you’re running it from. It should only log on pings that are taking more than 1 second to receive a response, which I don’t think is an unreasonably low bar to be alerted to for LAN-like (same Fly region private networking) latency.

Once running the script shows you a line count for each log file, you can then stop it with Ctrl-C and it’ll display the relevant log entries that have occured since the script was started. I’d leave it running for an hour or so to get an idea on how often the problem occurs, and to which apps’ VMs may be impacted. Running the script from multiple app VMs (i.e. source VMs/hopefully multiple Fly Hosts) in parallel may also help.

I’m currently seeing response times of up to 8 seconds for 2 app VMs on LHR host 81b8, occuring ~10-20 times an hour.

2 Likes

We are looking into this, but for now we don’t have any solid theories.

It does look like it’s network-related (of course).

I have been running a ping6 over the private network (fdaa) between 2 apps, one on 81b8 and one on another london host.

After almost 2 hours the worse I’ve seen is 16ms (which isn’t great, but it’s not nearly as bad as 8 seconds).

If you’re running this test continuously, have you observed issues in between 19:00 and 21:00 UTC?

Thank you for the reproduction steps btw! We appreciate users troubleshooting :slight_smile:

Though less than earlier today, yes - still observed some spiking between those times.

Some multi-second examples (albeit I accept they’re outside of the specified window):

SRC VM1 > DST VM1 on 81b8:
Sat Nov 12 21:29:03 UTC 2022 .... icmp_seq=18747 ttl=62 time=5025 ms
Sat Nov 12 21:29:04 UTC 2022 .... icmp_seq=18748 ttl=62 time=4010 ms
Sat Nov 12 21:29:06 UTC 2022 .... icmp_seq=18749 ttl=62 time=2986 ms
Sat Nov 12 21:29:07 UTC 2022 .... icmp_seq=18750 ttl=62 time=1962 ms

SRC VM1 > DST VM2 on 81b8:
Sat Nov 12 20:59:11 UTC 2022 .... icmp_seq=16974 ttl=62 time=5771 ms
Sat Nov 12 20:59:12 UTC 2022 .... icmp_seq=16975 ttl=62 time=4749 ms
Sat Nov 12 20:59:13 UTC 2022 .... icmp_seq=16976 ttl=62 time=3727 ms
Sat Nov 12 20:59:14 UTC 2022 .... icmp_seq=16977 ttl=62 time=2703 ms
Sat Nov 12 20:59:15 UTC 2022 .... icmp_seq=16978 ttl=62 time=1680 ms

SRC VM2 > DST VM1 on 81b8:
Sat Nov 12 21:29:03 UTC 2022 .... icmp_seq=18762 ttl=62 time=5403 ms
Sat Nov 12 21:29:04 UTC 2022 .... icmp_seq=18763 ttl=62 time=4379 ms
Sat Nov 12 21:29:05 UTC 2022 .... icmp_seq=18764 ttl=62 time=3354 ms
Sat Nov 12 21:29:06 UTC 2022 .... icmp_seq=18765 ttl=62 time=2330 ms
Sat Nov 12 21:29:07 UTC 2022 .... icmp_seq=18766 ttl=62 time=1306 ms

SRC VM2 > DST VM2 on 81b8:
Sat Nov 12 20:59:11 UTC 2022 .... icmp_seq=16990 ttl=62 time=5338 ms
Sat Nov 12 20:59:12 UTC 2022 .... icmp_seq=16991 ttl=62 time=4325 ms
Sat Nov 12 20:59:13 UTC 2022 .... icmp_seq=16992 ttl=62 time=3301 ms
Sat Nov 12 20:59:14 UTC 2022 .... icmp_seq=16993 ttl=62 time=2277 ms
Sat Nov 12 20:59:15 UTC 2022 .... icmp_seq=16994 ttl=62 time=1253 ms

Points of note:

  1. Despite (AFAIK) being on the same 81b8 host, the problem doesn’t appear to occur on the destination VMs (2 of, 2 apps/single VM per app) at the same time.
  2. It does however impact the same destination VM (on 81b8) at the same time from two different source VMs.
  3. Per the logs above (with the decreasing response time as time elapses - per the script’s output the dates shown are when the initial request was sent) - it’s like a buffered-window, where all the packets (on 81b8 Rx) are being held in a buffer for the period of time/window (e.g. 5 seconds) before hitting the 81b8 VM.
  4. My most recent/prolonged tests have been run from two source VMs, to six destination VMs. Only the destination VMs on host 81b8 show this issue.
  5. Not sure if it makes a difference - the apps are shared CPU (i.e. in case it’s a noisy neighbour type problem/if the apps you’re testing against on 81b8 are dedicated CPU).

I’ll run some more expansive tests/debugging tomorrow.

This morning I’ve been looking at traffic in both directions to/from an 81b8 VM (specifically one that’s doing very little). Unlike my initial RX thoughts - It appears that all (TX&RX) traffic disappears (to/from the VM in question) into network-no-man’s-land before being released at the end of the problem period (which ranges from 2 seconds > 10 seconds). The VM’s dmesg doesn’t indicate anything during these times.

Timestamps and ICMP sequence numbers, checked on both the sender and receiver (for inter-VM ping6), support this.

I’ve also noticed that this impacts all (or at least my ping tests) traffic - not just private networking:

  1. Private networking inter-VM ipv6 ping6
  2. Internet ipv6 ping6
  3. Internet ipv4 ping

With the matching sequence numbers I’m fairly certain the traffic isn’t being dropped, as the traffic is received (either at the 81b8 VM or from the 81b8 VM) - but only after the problem period.

Given it appears to impact all traffic, the following should be far easier than the script above (and doesn’t require two VMs):

echo $FLY_PUBLIC_IP | cut -d ":" -f 4

If the output of the above command is 81b8, then run:

ping google.com -D | stdbuf -o0 egrep "time=[0-9]{4,5} ms"

Only replies greater than 1 second will be shown. If you see many of these over an hour or so - you may have the issue.

1 Like

I now suspect it isn’t a network problem but some sort of CPU locking/IO wait problem (possibly only on shared CPUs?)…

If I run (i.e. to disk):

top -b -d 1 > /root/top.txt

I’ve noticed that missing top data correlates with the periods of the problem:

#!/bin/bash
mapfile -t logs_arr < <( grep "top - " /root/top.txt | cut -d " " -f 3 )
prev_log_seconds="00"
line_num=1
for log_line in "${logs_arr[@]}"
do
    current_log_seconds=$(echo $log_line | cut -d ":" -f 3)
    seconds_diff=$((10#$current_log_seconds-10#$prev_log_seconds));
    prev_log_seconds=$current_log_seconds
    if [ $line_num != 1 ] && [ $seconds_diff != -59 ] && [ $seconds_diff != 1 ] && [ $seconds_diff != 2 ]; then
        echo $line_num $seconds_diff $log_line
    fi
    line_num=$((line_num+1))
done

I then ran (log to ramdisk?):

top -b -d 1 > /dev/shm/top.txt

and didn’t miss top data for the problem periods but I can now see 100.0 wa in those times:

grep -B 2 "100.0 wa" /dev/shm/top.txt

The 81b8 VM I’m debugging on is the one that’s not doing much (if anything). Per Understanding FIRECRACKER LOAD AVERAGE (Part Deux) - shared CPU I’m not sure how the metrics top provides can be expected to differ in a Firecracker environment.

That this correlates with my observed problems, and only on 81b8, it would be good if someone could perform the tests above on their shared CPU VM(s) on LHR host 81b8.

Unhelpfully it doesn’t appear as bad this morning as last night - where I think I had at least one 14 second bout.

2 Likes

Good troubleshooting.

Looking at I/O on that host, it might be problematic. Trying to figure out what’s using so much.

I’m not sure why it would affect your VM, but not the one I was using to test this. Perhaps the CPU assignment has something to do with it.

Looks like high I/O was caused by our state replication mechanisms.

It just so happens we’ve been working for months on a new system for this and it’s ready to try on a larger scale.

I just replace the old software with the new one. It made a big difference in I/O usage on that host.

Hopefully your issues will be gone. Let us know if that’s not the case. We will be rolling this new system out to more hosts this week.

1 Like

I’ve not seen any health-check failures since ~1AM UTC - so this looks to have fixed it, thanks.

If not done already, would it be possible for Fly to monitor and alert upon (to Fly staff) high host IO?

I appreciate that in this sort of environment IO may normally fluctuate - but if (TBC) the IO seen on the host was outside of the norm (range & duration), enough to warrant investigation, it could have been dealt with proactively.

It’s entirely possible that alerting on this specific situation may be difficult (I think the largest period I’d observed was ~14 seconds*) due to any peaks in IO being averaged down by monitoring tools/sample periods. If however there was some sort of telling indicator, other than forum posts, it would be good if Fly could add it to their internal alerting.

*though “only” seconds in duration - in a real-time environment, with a high number of users, the detrimental impact to users’ experience could have been rather frustrating (and quite possibly difficult to diagnose).

2 Likes

Awesome. I’m glad that fixed it.

We did that after this whole ordeal. I’ve replaced the sub-optimal software stack on any server with far-too-high I/O utilization.

Yes, we should’ve caught that proactively. We have various alerts setup for other resources, but disk I/O wasn’t monitored closely. I had recently brought up that it might’ve been causing issues on some of our hosts, but we didn’t have solid evidence it was.

1 Like

@jerome - this reoccurred last night (~01:37 UTC), again limited to just the 2 VMs on LHR 81b8. Though having the problem appear once in a blue moon isn’t the end of the world - that it occurs and, as far as I can tell, only on 81b8 may suggest a localised problem on that host.

The earlier state replication version may not necessarily have been the root cause, albeit a large contributory factor, but possibly it was exacerbating an underlying (hardware/disk/noisy neighbour?) issue on 81b8.

I can see a large disk I/O spike around that time.

It is caused by the new state replication system :slight_smile: It doesn’t happen as often or in a sustained manner, but there are spikes.

We’re working on it.

Not sure why this particular host is more affected than others :thinking:. Maybe it has different hardware for its drives. Every other host also suffer a spike in I/O due to a flurry of updates to our state due to a bug in a system writing to the state.

I might add some rate limit to make sure we’re not overloading the disks.