Using backup regions that are geographically close to primary

Let’s say I have primary regions in the US (ewr), Europe (fra), and APAC (syd).

In my backup pool, I have iad (US), cdg (Europe), hkg (APAC) respectively.

However, this has led to situations where the app got deployed in 2 primary regions, e.g. ewr (US) and fra (EU), and a backup region that was already “covered”, i.e. cdg (EU again). In this case, no deployment was running in APAC at all.

It would be great if the backup regions would be picked in a way that it covers the missing region(s), i.e. it should have picked hkg to cover for the missing syd deployment.

I hope this makes sense? Am I missing anything, or do you have plans to change this?

Backup regions were a bit of a misfeature. They’re exceptionally difficult to build a nice UX for, and what you ran into is one of the reasons why.

That said, when you run three regions you’re usually better off disabling backup regions entirely : fly regions backup ewr fra syd), then when you set a count set a max-per-region flag: fly scale count 6 --max-per-region 2.

It is very unlikely that we will lose a whole region, and if we do we’ll temporarily just route you somewhere a little slower in this setup. sydhkg is not much of an improvement over sydewr.

We’re working on our scheduler options now. In the future, we should be able to let you define fallback rules with more precision.

3 Likes

better off disabling backup regions entirely

Yep, that’s what I ended up with.

Thanks for the clarification, and looking forward to updates in this area! (You’re right in the specific example of syd vs hkg vs ewr maybe, but e.g. having servers in Europe or not does make a significant difference to users in the region.)

It definitely does! We invented backup regions without much data on our infrastructure failure modes, though. We’ve since learned that the kinds of failures we experience usually don’t affect full regions. When they do, it’s for a period of minutes. Most people are ok with increased latency during those windows. At least for now. :wink:

Having fallback scheduling options to handle failures is important. We just got it wrong on the first attempt.

1 Like