Using backup regions that are geographically close to primary

hendrik · September 7, 2021, 7:10am

Let’s say I have primary regions in the US (ewr), Europe (fra), and APAC (syd).

In my backup pool, I have iad (US), cdg (Europe), hkg (APAC) respectively.

However, this has led to situations where the app got deployed in 2 primary regions, e.g. ewr (US) and fra (EU), and a backup region that was already “covered”, i.e. cdg (EU again). In this case, no deployment was running in APAC at all.

It would be great if the backup regions would be picked in a way that it covers the missing region(s), i.e. it should have picked hkg to cover for the missing syd deployment.

I hope this makes sense? Am I missing anything, or do you have plans to change this?

kurt · September 7, 2021, 4:21pm

Backup regions were a bit of a misfeature. They’re exceptionally difficult to build a nice UX for, and what you ran into is one of the reasons why.

That said, when you run three regions you’re usually better off disabling backup regions entirely : fly regions backup ewr fra syd), then when you set a count set a max-per-region flag: fly scale count 6 --max-per-region 2.

It is very unlikely that we will lose a whole region, and if we do we’ll temporarily just route you somewhere a little slower in this setup. syd → hkg is not much of an improvement over syd → ewr.

We’re working on our scheduler options now. In the future, we should be able to let you define fallback rules with more precision.

hendrik · September 8, 2021, 4:20am

better off disabling backup regions entirely

Yep, that’s what I ended up with.

Thanks for the clarification, and looking forward to updates in this area! (You’re right in the specific example of syd vs hkg vs ewr maybe, but e.g. having servers in Europe or not does make a significant difference to users in the region.)

kurt · September 8, 2021, 6:45pm

It definitely does! We invented backup regions without much data on our infrastructure failure modes, though. We’ve since learned that the kinds of failures we experience usually don’t affect full regions. When they do, it’s for a period of minutes. Most people are ok with increased latency during those windows. At least for now.

Having fallback scheduling options to handle failures is important. We just got it wrong on the first attempt.

Topic		Replies	Views
App starting in backup region Questions / Help	1	546	September 20, 2021
How does fly decide which regions to deploy in based on the region pool and backup regions?	3	619	March 6, 2023
Different Region on Each Deploy	9	1940	August 6, 2020
Backup regions wiped out, unable to re-add them	14	531	August 12, 2021
Can't define backup regions Questions / Help	2	568	July 22, 2021

Using backup regions that are geographically close to primary

Related topics