We’ve been experiencing network issues in the LHR region over the past couple of weeks and I’m curious if anyone else has encountered similar troubles?
Our situation involves unexpected CPU spikes on a product with modest load, primarily used by our internal team. These spikes are quite irregular and significantly deviate from our expected usage patterns.
During these periods, both nodes struggle with socket operations. They fail to read/write, leading to dropped existing connections (e.g., to the DB) and timeout errors on outbound requests (like ConfigCat cache updates).
Any insights or shared experiences would be greatly appreciated!
My shared experience is probably not relevant, but…
I have an app in LHR that has seen some odd CPU usage since upgrading to v2 machine based app. Every now and again it pegs the CPU at 100% until I restart the machine. Ran for ages on v1 app without doing this.
It may well just be a problem in my code but odd it never used to do it and I haven’t changed the code for ages.
If (TBC) you’re on a LHR host which is having resource problems, possibly triggering an app issue related to unexpected (e.g. IO issues) behaviours (subsequently causing your app to have a funny turn and peg the CPU), you’re always going to have the problem on that host until the underlying issue is resolved.
AFAIK one key difference between V1 and V2 is that with V2/machines - any given machine is tied to a specific host. With V1 if you redeployed you could end up on another host (“fixing” host-specific problems) - with V2/machines this isn’t the case.
I’ve had a few host-specific issues in the past, threads that may be of interest:
I’ve got a single app in LHR and since yesterday deployments via Github Actions have been failing with “Error: failed to update VM xxxxxxxxxxxxxx: aborted: could not reserve resource for machine: insufficient memory available to fulfill request”