“Hello, hello, anybody home? Hey, think Mc^H^HFly(.io), think!”… please could it be confirmed if the LHR WireGuard gateways are operating as expected post-Saturday-power-failure?
Please could someone from Fly.io expand on what Fly has done to prevent this occuring on the next unexpected DC power-down and/or mitigating the root cause in the other WireGuard gateway DCs?
Is there a reason why did it needed the customer to generate new keys to fix a Fly.io infrastructure problem? could it not have been resolved by Fly.io changing the previous gateway’s misconfiguration (not accepting new peers)?
Hi @Whistler ! I’m digging into the lhr WireGuard issue. Will update here as I know more about the root cause.
At this moment of writing, creating new tunnels in lhr is working as expected. May I ask you if the issue you are having is for connecting to an existing tunnel or creating new ones? If the issue is with an existing tunnel, would be possible to get the WireGuard peer name?
Hi @Whistler ! To close the loop here about the cause of this issue. We’ve recently changed the way WireGuard peers are added to the kernel, and there was an edge case we haven’t faced until the power outage in lhr. This bug was preventing the JIT functionality to work as intended by incorrectly reporting peers as already present in the system and hence not being added to the kernel as they should.
This bug has been fixed for all gateways in every region and we are actively working on increasing reliability.