Brazil (gru) is down?

Hi,

I noticed my app is no longer running in the gru location.
My app moved to dfw(B), which sucks for our use case.
Please let me know the status of gru.
We must have a stable location outside NA and EU to host our app.

Is DFW a backup region for your app and is GRU a primary region for your app?

Brazil is operational and not very crowded.

Region Pool: 
atl
gru
iad
lhr
sjc
Backup Region: 
dfw

and we used fly scale count 5 --max-per-region 1 so if gru is fine, why did my app move from gru to dfw?

Just ran flyctl regions backup atl iad sjc lhr gru

Region Pool: 
atl
gru
iad
lhr
sjc
Backup Region:

with the see if app is moved back to gru from dfw.

fyi, I used the fly scale count 5 --max-per-region 1 after this conversation: Need help estimating costs for Volumes - #2 by kurt

I was looking at your app as it was being deployed. It does look like that fixed it.

We’ve realised backup regions were a mistake. We’re working on a new, better, scheduler that will probably work a bit differently regarding regions (well, backups will be a thing of the past, or something like that). More details to come soon-ish as we hone in on the details.

Yeah, it moved from dfw to gru so that is good.
I’d love to have this:

  1. Set regions and be sure to always have at least 1 instance in each region (unless region is down/overwhelmed)
  2. Set a max per region (e.g. scale iad to max 5, but atl is always one
  3. If traffic spikes in a region, reroute traffic to a nearby region. So if the one instance in lhr can’t handle it, reroute traffic to ams (close to lhr) instead of routing traffic to a region close to the edge that handles the request coming into Fly

Thanks for the quick response!

That plan from Aaron would be neat.

Another thought on this subject of regions (instead of backup regions) perhaps have a larger geographic region like regions eu or regions eu, na. Alongside the option of a granular region like “lhr” or “ams” if that’s needed by the user for latency.

Using “eu” would be simpler for a user as it would avoid the complexity of specifying e.g “lhr” with a backup of “ams”. Handy for people that want a geographic region like Europe for compliance and are less concerned about per-ms latency. You could set e.g 5 instances for “eu”. Then if e.g “lhr” was down, that would be seamlessly handled by “ams” without the user needing to specifically set that as a backup region. And your routing would know the user would want e.g “ams” and not another in the pool e.g “iad”

Hey @jerome

My app is no longer running in 2 of the 10 locations. lhr and cdg are missing.
Fly moved the app to atl and maa it seems.

ID       TASK VERSION REGION DESIRED STATUS  HEALTH CHECKS      RESTARTS CREATED    
aedf42c6 app  100     gru    run     running 1 total, 1 passing 0        7h59m ago  
41ed3a04 app  100     iad    run     running 1 total, 1 passing 0        7h59m ago  
10dcb959 app  100     atl    run     running 1 total, 1 passing 0        9h17m ago  
805c0b7a app  100     atl    run     running 1 total, 1 passing 0        9h17m ago  
3fbe7646 app  100     hkg    run     running 1 total, 1 passing 1        19h49m ago 
59440054 app  100     nrt    run     running 1 total, 1 passing 3        19h57m ago 
23f21ddc app  100     maa    run     running 1 total, 1 passing 3        19h57m ago 
74042cf7 app  100     maa    run     running 1 total, 1 passing 3        19h58m ago 
33f8b70c app  100     syd    run     running 1 total, 1 passing 5        19h59m ago 
89391d10 app  100     sjc    run     running 1 total, 1 passing 5        20h0m ago 

I did not do a deploy in past 24 hrs.
My regions list has not changed:

Region Pool: 
atl
cdg
gru
hkg
iad
lhr
maa
nrt
sjc
syd
Backup Region:

My questions

  1. Why did the app move out of those locations in Europe?
  2. What can I do to ensure the app runs in those 10 locations, always (unless dc is down or completely overwhelmed)?

I think the only way your app runs in all these regions is to the your scale count to: max-per-region * regions count. In this case, your count should 10? Assuming you want 1 per region and your max-per-region is set to 1.

Our current scheduler doesn’t let us do all we’d like to do. This is sort of a hack to get it working. We’re working on our own scheduler that’ll let us do anything and everything (but that’s still a few months out, probably).

Correct!

I’m using fly scale count 10 --max-per-region 1 and have 10 regions set.

Are you suggesting I use fly scale count 10 --max-per-region * regions count instead?

No, that sounds like the correct settings. max-per-region (1) * regions count (10) = 10

I’m not entirely sure what happened there. It does look like it just shifted for no reason. Perhaps one of the hosts became unhealthy momentarily and rescheduled stuff around so it would honor the count. Then it didn’t reschedule?

This doesn’t make sense to me. We’re very frustrated with the scheduling constraints of our current scheduler.

I think @kurt might be able to help more when he gets up :slight_smile:. I see you’ve redeployed and your instances got placed in the right spots, for now.

Thanks.
I guess for now I need to keep tabs myself on what regions the app is running in and do a deploy whenever it’s not what I want (ugh).

I’m looking at this, I’m pretty sure there was something else going on. As much as we are frustrated by our scheduler it’s never doubled up in regions when it shouldn’t before.

@aaronpeters do you know why your VMs are restarting? If you run fly status --all you can see that two VMs “failed”, and then running fly vm status <id> shows that they exited with an error several times, then got rescheduled. fly logs -i <id> shows a node stacktrace.

We’re looking to see why they got rescheduled in the wrong regions, that’s unexpected. But if you can debug those repeated crashes I don’t think you’ll have this problem again.

Oh, thanks, looking into this now

This gives me

Instances
ID       TASK VERSION REGION DESIRED STATUS   HEALTH CHECKS      RESTARTS CREATED    
482125d0 app  101 ⇡   sjc    run     running  1 total, 1 passing 0        2h53m ago  
34a25e85 app  101 ⇡   atl    run     running  1 total, 1 passing 0        2h54m ago  
99d9a4fb app  101 ⇡   maa    run     running  1 total, 1 passing 0        2h55m ago  
840be25a app  101 ⇡   gru    run     running  1 total, 1 passing 0        2h56m ago  
e4f01116 app  101 ⇡   hkg    run     running  1 total, 1 passing 0        2h57m ago  
019686ac app  101 ⇡   syd    run     running  1 total, 1 passing 0        2h58m ago  
c217982b app  101 ⇡   nrt    run     running  1 total, 1 passing 0        2h59m ago  
72197521 app  101 ⇡   iad    run     running  1 total, 1 passing 0        2h59m ago  
cd301e99 app  101 ⇡   lhr    run     running  1 total, 1 passing 0        3h0m ago   
1dd1f1b7 app  101 ⇡   cdg    run     running  1 total, 1 passing 0        3h1m ago   
aedf42c6 app  100     gru    stop    complete 1 total, 1 passing 0        11h14m ago 
41ed3a04 app  100     iad    stop    complete 1 total, 1 passing 0        11h14m ago 
10dcb959 app  100     atl    stop    complete 1 total, 1 passing 0        12h32m ago 
805c0b7a app  100     atl    stop    complete 1 total, 1 passing 0        12h32m ago 
3fbe7646 app  100     hkg    stop    complete 1 total, 1 passing 1        23h4m ago  
59440054 app  100     nrt    stop    complete 1 total, 1 passing 3        23h12m ago 
23f21ddc app  100     maa    stop    complete 1 total, 1 passing 3        23h12m ago 
74042cf7 app  100     maa    stop    complete 1 total, 1 passing 3        23h13m ago 
de00a56b app  100     gru    stop    complete 1 total, 1 passing 4        23h14m ago 
33f8b70c app  100     syd    stop    complete 1 total, 1 passing 5        23h14m ago 
a33d8e80 app  100     gru    stop    complete 1 total, 1 passing 1        23h14m ago 
89391d10 app  100     sjc    stop    complete 1 total, 1 passing 5        23h15m ago 

No VMs there that “failed”.
From fly logs I don’t see anything weird between 2021-09-17T12:25:45.533855429Z and 2021-09-17T13:40:33.694493492Z

And fly logs does not go back before 2021-09-17T12:25:45.533855429Z, so I can’t see what happened before that time

Oh wow, the records “expired” between when I said that and you tried it. I saw these when I checked:

f8e11492 app  100     atl    stop    failed   1 total            4        22h38m ago
23f21ddc app  100     maa    stop    complete 1 total, 1 passing 3        22h38m ago
74042cf7 app  100     maa    stop    complete 1 total, 1 passing 3        22h39m ago
e9b1d7a1 app  100     cdg    stop    failed   1 total            4        22h39m ago

This showed the stack traces: fly logs -i f8e11492

Thanks!
We fixed that issue earlier today so we should be good now

Nice, we just tweaked the fly status --all output to show any failed VM from the last 2 days. It was previously 12 hours so it vanished before you saw it. :slight_smile:

1 Like