Any issues with your routing over the past half hour?

Hello,

I got an alert that an app wasn’t working, and indeed it wasn’t. It was timing out.

I looked at others too.

I checked your home page and that was down too, so it wasn’t my app, code etc. Has anything been changed recently?

(They have come back up again now and your site is back up too)

There was a short blip in our London location. We’re investigating this.

Ah, that would explain it. I should have said it was LHR.

So … how does the situation work with multiple regions. Like I have an app in LHR and IAD. What happens to requests in those cases? I had assumed that if a region was down, which I understand does happen, requests would just be handled by the other region, and so would get no error. But that didn’t seem to be the case here as I did get an error.

Or would that have kicked in at some point and taken LHR out of service, essentially, so all requests would have been routed to IAD?

Currently it is a good time of the day for me, and probably for you too, but if this happened at 5am or something I’m wondering what would have happened.

There are lots of different regional problems that we respond to in different ways. This one was a network issue that caused connections from outside to fail. We’re trying to figure out if it was us (meaning, our network provider) or upstream of us (meaning, we’re too small to get someone’s attention).

When this type of thing happens, we’ll typically route around an affected region. This is not an automated process, we get paged, check to see what’s happening, and then basically pull the plug by hand. Withdrawing routes for a region is disruptive to traffic that is flowing, so we’re somewhat careful. This particular flap lasted ~3 minutes, we didn’t get far enough into diagnosing to respond.

We didn’t have any issues connecting to LHR over our own backhaul this time. People going through Paris would still be able to connect to your VMs.

There are other things that can happen with regions. If there’s a power outage, for example, routes get withdrawn automatically (and VMs get rescheduled other places, if possible).

It actually doesn’t matter what time it is, we are always on call so the response is basically the same at 5am as late afternoon. Also, @jerome has a young child so he’s awake 24x7. Convenient.

3 Likes

I see. I wasn’t sure what kind of automated/manual process was involved.

Thanks for the detailed explanation :heart:

No problem! And just so I say this out loud, a 3 min network flap sucks and we will get better and hiding those from you. :smiley:

1 Like