I’m not sure what happened here, but I was experimenting with anchored scaling and my region list. I don’t know exactly which command did it but all of a sudden, no instances. I had 6, all in EWR & IAD, then poof! All gone. I thought maybe it was a CLI glitch but metrics across the board have instantly gone to zero. I also just pushed a new deployment of my app just to shake things lose, that did not work either. I updated my region list to add new regions, also did not work. This is quite scary, its my live production app, thankfully I didn’t delete my heroku stack yet and Ive managed to cut traffic back over to Heroku.
I’d appreciate any help that anyone can offer, here are some details/output
flyctl scale show --app <name>
VM Resources for <name>
VM Size: shared-cpu-1x
VM Memory: 512 MB
Count: 6
Max Per Region: Not set
flyctl regions list --app <name>
Region Pool:
ewr
iad
ord
Backup Region:
vin
vin
vin
yyz
flyctl status --app <name>
Update available 0.0.269 -> v0.0.270.
Run "flyctl version update" to upgrade.
App
Name = <name>
Owner = personal
Version = 31
Status = pending
Hostname =<name>.fly.dev
Deployment Status
ID = 981639c2-7f66-b21a-2fcb-3fc68d8827c3
Version = v31
Status = running
Description = Deployment is running
Instances = 6 desired, 0 placed, 0 healthy, 0 unhealthy
Instances
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
Yikes, that sounds troubling. Which commands did you run for playing with scaling? Also, can you run fly status --all, fly autoscale show and fly scale show?
I initially had my instances just doing count-based scaling, but I wanted a specific number of instances in fra and syd, So I created volumes for each of the 6 instances in the regions I want, added the mounts to fly.toml then deployed. That deployment worked, but it exposed some latency issues on my end, so I attempted to roll things back to count based scaling, which is where things got weird. Here is what I believe I did before things disappeared:
I deleted all the volumes (I expected that this would be blocked if I couldn’t delete volumes out from under a live instance, I also expected that this would automagically flip me back to count based scaling, however this just worked)
After instances disappeared I figured it may be caused by deleting the volumes, so I removed the mounts from my fly.toml and re-deployed (I have done this a few times) however has not fix things. Every deployment I do get stuck in the pending state forever
My flyctl status --all
App
Name = <name>
Owner = personal
Version = 33
Status = pending
Hostname = pa-app-production.fly.dev
Instances
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
b87e5c94 app 26 ewr stop failed 0 9h31m ago
e012f35d app 26 lax stop failed 0 9h31m ago
fe176f52 app 26 ewr stop failed 0 9h31m ago
7b8f686d app 26 ewr stop failed 0 9h31m ago
1d810471 app 26 dfw stop failed 0 9h31m ago
d94205b0 app 26 iad stop failed 0 9h31m ago
bf0b51cf app 26 fra stop failed 0 9h37m ago
2a4f5406 app 26 iad stop complete 0 9h53m ago
a1cfc380 app 26 ewr stop complete 0 9h54m ago
bacf6b02 app 26 lax stop complete 0 9h57m ago
2e048fab app 26 dfw stop complete 0 10h0m ago
0d9136e9 app 25 syd stop failed 0 9h37m ago
ae172812 app 25 fra stop complete 0 9h55m ago
50d58b14 app 24 syd stop complete 0 9h58m ago
0a4057b1 app 23 maa stop complete 0 10h23m ago
560b9c7b app 23 mad stop complete 0 10h23m ago
545aa853 app 23 gru stop complete 0 10h23m ago
96a0d270 app 23 mia stop complete 0 10h23m ago
035b2621 app 23 cdg stop complete 0 10h23m ago
1a9cd11c app 23 mia stop complete 0 10h23m ago
f08914c9 app 21 ewr stop complete 0 10h55m ago
31276bc2 app 21 iad stop complete 0 10h56m ago
9b82f833 app 21 iad stop complete 0 10h57m ago
f9da9c06 app 21 vin(B) stop complete 0 10h58m ago
bd31ae3d app 21 vin(B) stop complete 0 10h59m ago
c234589a app 21 ewr stop complete 0 11h0m ago
ae36c7eb app 20 syd stop complete 0 11h0m ago
59cd17a1 app 20 fra stop complete 0 11h0m ago
78e8404a app 20 maa stop complete 0 11h0m ago
fc3e5cae app 20 ewr stop complete 0 11h0m ago
afc1eb96 app 20 fra stop complete 0 11h0m ago
f9e0400d app 20 syd stop complete 0 11h0m ago
cde72314 app 19 dfw stop complete 0 11h26m ago
b1667f67 app 19 gru stop complete 0 11h29m ago
7d6b3f5d app 19 maa stop complete 0 11h29m ago
51c271c1 app 19 mia stop complete 0 11h29m ago
b0b46478 app 19 maa stop complete 0 11h29m ago
c97666e9 app 19 gru stop complete 0 11h29m ago
8235baaf app 16 fra stop complete 0 11h39m ago
d9fe3662 app 16 syd stop complete 0 11h41m ago
dd63a3f6 app 16 syd stop complete 0 11h42m ago
2a7cd524 app 16 maa stop complete 0 11h44m ago
d9d9abb4 app 16 maa stop complete 0 11h45m ago
9cac4d9e app 15 maa stop complete 0 11h46m ago
67c9b4a7 app 14 yyz(B) stop complete 0 2021-12-22T03:29:44Z
2d9dabb6 app 14 yyz(B) stop complete 0 2021-12-22T03:29:11Z
e09c0c72 app 14 vin(B) stop complete 0 2021-12-22T03:27:48Z
0e8a1768 app 14 ewr stop complete 0 2021-12-22T03:26:54Z
5cf29cce app 14 ewr stop complete 0 2021-12-22T03:25:54Z
My fly.toml
[build]
dockerfile = "Dockerfile.app"
[build.args]
ENV_PATH="<path>"
[env]
PORT = "8080"
[[services.ports]]
handlers = ["tls", "http"]
port = "443"
[services.concurrency]
hard_limit = 100
type = "requests"
Shes alive!! Thanks @kurt and @jsierles for the quick turnaround. Bit of feedback (you’re likely on this already but…) if deleting an attached volume will kill my instances, the cli should print out a warning. It seems like even in the happy path scenario any instances with the volumes attached would have been shut down (would work fine) then attempted to be re-deployed without the volumes but if the fly.toml still has the volume mounted this part will always fail. So maybe just a stern warning if fly has enough information to discern that the user is about to shoot themselves in the foot.
Also, as I think about this, if the re-deployment process failed for one instance, then the rollout should have stopped and should not have taken out my entire fleet, just one instance. Anyway, I’m making lots of assumptions on how fly works but the cost of my mistake was extremely high , if there is anything you can do on your end to make this less likely, that would be
Thanks again guys! Traffic is already headed back to fly