Hi,
I noticed my app is no longer running in the gru location.
My app moved to dfw(B)
, which sucks for our use case.
Please let me know the status of gru.
We must have a stable location outside NA and EU to host our app.
Hi,
I noticed my app is no longer running in the gru location.
My app moved to dfw(B)
, which sucks for our use case.
Please let me know the status of gru.
We must have a stable location outside NA and EU to host our app.
Is DFW a backup region for your app and is GRU a primary region for your app?
Brazil is operational and not very crowded.
Region Pool:
atl
gru
iad
lhr
sjc
Backup Region:
dfw
and we used fly scale count 5 --max-per-region 1
so if gru is fine, why did my app move from gru to dfw?
Just ran flyctl regions backup atl iad sjc lhr gru
Region Pool:
atl
gru
iad
lhr
sjc
Backup Region:
with the see if app is moved back to gru from dfw.
fyi, I used the fly scale count 5 --max-per-region 1
after this conversation: Need help estimating costs for Volumes - #2 by kurt
I was looking at your app as it was being deployed. It does look like that fixed it.
We’ve realised backup regions were a mistake. We’re working on a new, better, scheduler that will probably work a bit differently regarding regions (well, backups will be a thing of the past, or something like that). More details to come soon-ish as we hone in on the details.
Yeah, it moved from dfw to gru so that is good.
I’d love to have this:
Thanks for the quick response!
That plan from Aaron would be neat.
Another thought on this subject of regions (instead of backup regions) perhaps have a larger geographic region like regions eu
or regions eu, na
. Alongside the option of a granular region like “lhr” or “ams” if that’s needed by the user for latency.
Using “eu” would be simpler for a user as it would avoid the complexity of specifying e.g “lhr” with a backup of “ams”. Handy for people that want a geographic region like Europe for compliance and are less concerned about per-ms latency. You could set e.g 5 instances for “eu”. Then if e.g “lhr” was down, that would be seamlessly handled by “ams” without the user needing to specifically set that as a backup region. And your routing would know the user would want e.g “ams” and not another in the pool e.g “iad”
Hey @jerome
My app is no longer running in 2 of the 10 locations. lhr
and cdg
are missing.
Fly moved the app to atl
and maa
it seems.
ID TASK VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
aedf42c6 app 100 gru run running 1 total, 1 passing 0 7h59m ago
41ed3a04 app 100 iad run running 1 total, 1 passing 0 7h59m ago
10dcb959 app 100 atl run running 1 total, 1 passing 0 9h17m ago
805c0b7a app 100 atl run running 1 total, 1 passing 0 9h17m ago
3fbe7646 app 100 hkg run running 1 total, 1 passing 1 19h49m ago
59440054 app 100 nrt run running 1 total, 1 passing 3 19h57m ago
23f21ddc app 100 maa run running 1 total, 1 passing 3 19h57m ago
74042cf7 app 100 maa run running 1 total, 1 passing 3 19h58m ago
33f8b70c app 100 syd run running 1 total, 1 passing 5 19h59m ago
89391d10 app 100 sjc run running 1 total, 1 passing 5 20h0m ago
I did not do a deploy in past 24 hrs.
My regions list has not changed:
Region Pool:
atl
cdg
gru
hkg
iad
lhr
maa
nrt
sjc
syd
Backup Region:
My questions
I think the only way your app runs in all these regions is to the your scale count to: max-per-region * regions count
. In this case, your count should 10? Assuming you want 1 per region and your max-per-region is set to 1.
Our current scheduler doesn’t let us do all we’d like to do. This is sort of a hack to get it working. We’re working on our own scheduler that’ll let us do anything and everything (but that’s still a few months out, probably).
Correct!
I’m using fly scale count 10 --max-per-region 1
and have 10 regions set.
Are you suggesting I use fly scale count 10 --max-per-region * regions count
instead?
No, that sounds like the correct settings. max-per-region (1) * regions count (10) = 10
I’m not entirely sure what happened there. It does look like it just shifted for no reason. Perhaps one of the hosts became unhealthy momentarily and rescheduled stuff around so it would honor the count
. Then it didn’t reschedule?
This doesn’t make sense to me. We’re very frustrated with the scheduling constraints of our current scheduler.
I think @kurt might be able to help more when he gets up . I see you’ve redeployed and your instances got placed in the right spots, for now.
Thanks.
I guess for now I need to keep tabs myself on what regions the app is running in and do a deploy whenever it’s not what I want (ugh).
I’m looking at this, I’m pretty sure there was something else going on. As much as we are frustrated by our scheduler it’s never doubled up in regions when it shouldn’t before.
@aaronpeters do you know why your VMs are restarting? If you run fly status --all
you can see that two VMs “failed”, and then running fly vm status <id>
shows that they exited with an error several times, then got rescheduled. fly logs -i <id>
shows a node stacktrace.
We’re looking to see why they got rescheduled in the wrong regions, that’s unexpected. But if you can debug those repeated crashes I don’t think you’ll have this problem again.
Oh, thanks, looking into this now
This gives me
Instances
ID TASK VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
482125d0 app 101 ⇡ sjc run running 1 total, 1 passing 0 2h53m ago
34a25e85 app 101 ⇡ atl run running 1 total, 1 passing 0 2h54m ago
99d9a4fb app 101 ⇡ maa run running 1 total, 1 passing 0 2h55m ago
840be25a app 101 ⇡ gru run running 1 total, 1 passing 0 2h56m ago
e4f01116 app 101 ⇡ hkg run running 1 total, 1 passing 0 2h57m ago
019686ac app 101 ⇡ syd run running 1 total, 1 passing 0 2h58m ago
c217982b app 101 ⇡ nrt run running 1 total, 1 passing 0 2h59m ago
72197521 app 101 ⇡ iad run running 1 total, 1 passing 0 2h59m ago
cd301e99 app 101 ⇡ lhr run running 1 total, 1 passing 0 3h0m ago
1dd1f1b7 app 101 ⇡ cdg run running 1 total, 1 passing 0 3h1m ago
aedf42c6 app 100 gru stop complete 1 total, 1 passing 0 11h14m ago
41ed3a04 app 100 iad stop complete 1 total, 1 passing 0 11h14m ago
10dcb959 app 100 atl stop complete 1 total, 1 passing 0 12h32m ago
805c0b7a app 100 atl stop complete 1 total, 1 passing 0 12h32m ago
3fbe7646 app 100 hkg stop complete 1 total, 1 passing 1 23h4m ago
59440054 app 100 nrt stop complete 1 total, 1 passing 3 23h12m ago
23f21ddc app 100 maa stop complete 1 total, 1 passing 3 23h12m ago
74042cf7 app 100 maa stop complete 1 total, 1 passing 3 23h13m ago
de00a56b app 100 gru stop complete 1 total, 1 passing 4 23h14m ago
33f8b70c app 100 syd stop complete 1 total, 1 passing 5 23h14m ago
a33d8e80 app 100 gru stop complete 1 total, 1 passing 1 23h14m ago
89391d10 app 100 sjc stop complete 1 total, 1 passing 5 23h15m ago
No VMs there that “failed”.
From fly logs
I don’t see anything weird between 2021-09-17T12:25:45.533855429Z and 2021-09-17T13:40:33.694493492Z
And fly logs
does not go back before 2021-09-17T12:25:45.533855429Z, so I can’t see what happened before that time
Oh wow, the records “expired” between when I said that and you tried it. I saw these when I checked:
f8e11492 app 100 atl stop failed 1 total 4 22h38m ago
23f21ddc app 100 maa stop complete 1 total, 1 passing 3 22h38m ago
74042cf7 app 100 maa stop complete 1 total, 1 passing 3 22h39m ago
e9b1d7a1 app 100 cdg stop failed 1 total 4 22h39m ago
This showed the stack traces: fly logs -i f8e11492
Thanks!
We fixed that issue earlier today so we should be good now
Nice, we just tweaked the fly status --all
output to show any failed VM from the last 2 days. It was previously 12 hours so it vanished before you saw it.