LHR WireGuard issues after yesterday's DC power failure?

Is anyone else having LHR WireGuard (no)connectivity issues since yesterday’s LHR DC power failure?

The status page regarding this issue mentions:

Resolved - This incident has been resolved.
Feb 3, 10:02 CST
Monitoring - All services are back online. We will continue to monitor.
Feb 3, 09:04 CST

Although I’m not sure if Fly’s LHR WireGuard gateways are back online and working correctly.

Hello, hello, anybody home? Hey, think Mc^H^HFly(.io), think!”… please could it be confirmed if the LHR WireGuard gateways are operating as expected post-Saturday-power-failure?

My LHR tunnel is fine, and was all day yesterday. Have you tried emailing support?

Ours is down since the 2nd too. Can’t ping any endpoints over it. I’ve restart them. Must be internal. We’re in LHR too.

I’ve emailed support about this.

Thanks both for confirming; so there may be a problem but possibly not impacting all LHR tunnels, peers and/or Orgs.

@suretec - please update the thread if/when support provide any feedback.

Will do. I can’t resolve DNS over the tunnel or ping each end, so definitely a network issue.

@Whistler Support have got us working again. We had to remove the link and re-create:

fly wireguard list
fly wireguard remove our-office
fly wireguard list
fly wireguard create our-org lhr our-office
fly wireguard list
fly console

then ping our-office._peer.internal works within fly console (after install ping on our container) and ping our-app.internal works from our office.

HTH.

Official reason:

When I checked earlier, it looked like our-office was on one of two gateway hosts that were affected by the lhr power issues.

The previous gateway isn’t configured to accept new peers, so running fly wg create generated a new config on a different gateway for you.

Thank you for the update.

Please could someone from Fly.io expand on what Fly has done to prevent this occuring on the next unexpected DC power-down and/or mitigating the root cause in the other WireGuard gateway DCs?

Is there a reason why did it needed the customer to generate new keys to fix a Fly.io infrastructure problem? could it not have been resolved by Fly.io changing the previous gateway’s misconfiguration (not accepting new peers)?

Hi @Whistler ! I’m digging into the lhr WireGuard issue. Will update here as I know more about the root cause.

At this moment of writing, creating new tunnels in lhr is working as expected. May I ask you if the issue you are having is for connecting to an existing tunnel or creating new ones? If the issue is with an existing tunnel, would be possible to get the WireGuard peer name?

Thanks!

@aschiavo I’ve emailed the details to the support@fly.io email address.

Thanks @Whistler ! I’ll continue investigating and update here.

Added lhr

Hi @Whistler ! To close the loop here about the cause of this issue. We’ve recently changed the way WireGuard peers are added to the kernel, and there was an edge case we haven’t faced until the power outage in lhr. This bug was preventing the JIT functionality to work as intended by incorrectly reporting peers as already present in the system and hence not being added to the kernel as they should.

This bug has been fixed for all gateways in every region and we are actively working on increasing reliability.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.