Help, all my instances disappeared

Hello!

I’m not sure what happened here, but I was experimenting with anchored scaling and my region list. I don’t know exactly which command did it but all of a sudden, no instances. I had 6, all in EWR & IAD, then poof! All gone. I thought maybe it was a CLI glitch but metrics across the board have instantly gone to zero. I also just pushed a new deployment of my app just to shake things lose, that did not work either. I updated my region list to add new regions, also did not work. This is quite scary, its my live production app, thankfully I didn’t delete my heroku stack yet and Ive managed to cut traffic back over to Heroku.

I’d appreciate any help that anyone can offer, here are some details/output

flyctl scale show --app <name>
VM Resources for <name>
        VM Size: shared-cpu-1x
      VM Memory: 512 MB
          Count: 6
 Max Per Region: Not set
flyctl regions list --app <name>
Region Pool:
ewr
iad
ord
Backup Region:
vin
vin
vin
yyz
flyctl status --app <name>
Update available 0.0.269 -> v0.0.270.
Run "flyctl version update" to upgrade.
App
  Name     = <name>
  Owner    = personal
  Version  = 31
  Status   = pending
  Hostname =<name>.fly.dev

Deployment Status
  ID          = 981639c2-7f66-b21a-2fcb-3fc68d8827c3
  Version     = v31
  Status      = running
  Description = Deployment is running
  Instances   = 6 desired, 0 placed, 0 healthy, 0 unhealthy

Instances
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED

:sob:

Metrics:

I checked the logs too for the app, it shows nothing but terminated requests from all the previous instances shutting down, and a few of these:

Main child exited with signal (with signal 'SIGINT', core dumped? false)

Yikes, that sounds troubling. Which commands did you run for playing with scaling? Also, can you run fly status --all, fly autoscale show and fly scale show?

Will you share your fly.toml? Also, what does fly status --all show?

6 desired instances and 0 running makes it look like it might be expecting volumes. Are you using volumes on this app? And do you have 6 of them?

I initially had my instances just doing count-based scaling, but I wanted a specific number of instances in fra and syd, So I created volumes for each of the 6 instances in the regions I want, added the mounts to fly.toml then deployed. That deployment worked, but it exposed some latency issues on my end, so I attempted to roll things back to count based scaling, which is where things got weird. Here is what I believe I did before things disappeared:

  • I deleted all the volumes (I expected that this would be blocked if I couldn’t delete volumes out from under a live instance, I also expected that this would automagically flip me back to count based scaling, however this just worked)
  • After instances disappeared I figured it may be caused by deleting the volumes, so I removed the mounts from my fly.toml and re-deployed (I have done this a few times) however has not fix things. Every deployment I do get stuck in the pending state forever

My flyctl status --all

App
  Name     = <name>
  Owner    = personal
  Version  = 33
  Status   = pending
  Hostname = pa-app-production.fly.dev

Instances
ID       PROCESS VERSION REGION DESIRED STATUS   HEALTH CHECKS RESTARTS CREATED
b87e5c94 app     26      ewr    stop    failed                 0        9h31m ago
e012f35d app     26      lax    stop    failed                 0        9h31m ago
fe176f52 app     26      ewr    stop    failed                 0        9h31m ago
7b8f686d app     26      ewr    stop    failed                 0        9h31m ago
1d810471 app     26      dfw    stop    failed                 0        9h31m ago
d94205b0 app     26      iad    stop    failed                 0        9h31m ago
bf0b51cf app     26      fra    stop    failed                 0        9h37m ago
2a4f5406 app     26      iad    stop    complete               0        9h53m ago
a1cfc380 app     26      ewr    stop    complete               0        9h54m ago
bacf6b02 app     26      lax    stop    complete               0        9h57m ago
2e048fab app     26      dfw    stop    complete               0        10h0m ago
0d9136e9 app     25      syd    stop    failed                 0        9h37m ago
ae172812 app     25      fra    stop    complete               0        9h55m ago
50d58b14 app     24      syd    stop    complete               0        9h58m ago
0a4057b1 app     23      maa    stop    complete               0        10h23m ago
560b9c7b app     23      mad    stop    complete               0        10h23m ago
545aa853 app     23      gru    stop    complete               0        10h23m ago
96a0d270 app     23      mia    stop    complete               0        10h23m ago
035b2621 app     23      cdg    stop    complete               0        10h23m ago
1a9cd11c app     23      mia    stop    complete               0        10h23m ago
f08914c9 app     21      ewr    stop    complete               0        10h55m ago
31276bc2 app     21      iad    stop    complete               0        10h56m ago
9b82f833 app     21      iad    stop    complete               0        10h57m ago
f9da9c06 app     21      vin(B) stop    complete               0        10h58m ago
bd31ae3d app     21      vin(B) stop    complete               0        10h59m ago
c234589a app     21      ewr    stop    complete               0        11h0m ago
ae36c7eb app     20      syd    stop    complete               0        11h0m ago
59cd17a1 app     20      fra    stop    complete               0        11h0m ago
78e8404a app     20      maa    stop    complete               0        11h0m ago
fc3e5cae app     20      ewr    stop    complete               0        11h0m ago
afc1eb96 app     20      fra    stop    complete               0        11h0m ago
f9e0400d app     20      syd    stop    complete               0        11h0m ago
cde72314 app     19      dfw    stop    complete               0        11h26m ago
b1667f67 app     19      gru    stop    complete               0        11h29m ago
7d6b3f5d app     19      maa    stop    complete               0        11h29m ago
51c271c1 app     19      mia    stop    complete               0        11h29m ago
b0b46478 app     19      maa    stop    complete               0        11h29m ago
c97666e9 app     19      gru    stop    complete               0        11h29m ago
8235baaf app     16      fra    stop    complete               0        11h39m ago
d9fe3662 app     16      syd    stop    complete               0        11h41m ago
dd63a3f6 app     16      syd    stop    complete               0        11h42m ago
2a7cd524 app     16      maa    stop    complete               0        11h44m ago
d9d9abb4 app     16      maa    stop    complete               0        11h45m ago
9cac4d9e app     15      maa    stop    complete               0        11h46m ago
67c9b4a7 app     14      yyz(B) stop    complete               0        2021-12-22T03:29:44Z
2d9dabb6 app     14      yyz(B) stop    complete               0        2021-12-22T03:29:11Z
e09c0c72 app     14      vin(B) stop    complete               0        2021-12-22T03:27:48Z
0e8a1768 app     14      ewr    stop    complete               0        2021-12-22T03:26:54Z
5cf29cce app     14      ewr    stop    complete               0        2021-12-22T03:25:54Z

My fly.toml

[build]
  dockerfile = "Dockerfile.app"

[build.args]
  ENV_PATH="<path>"

[env]
  PORT = "8080"

[[services.ports]]
  handlers = ["tls", "http"]
  port = "443"

[services.concurrency]
  hard_limit = 100
  type = "requests"

Thank you for the information! There might be a bug preventing the [[mounts]] from being cleared. Give us a bit to investigate.

Deleting a volume stops the associated VM and deletes the volume. Volumes + counts can be surprising.

Yep, that was the bug. You should be all set now (and we’ll hopefully have the bug fixed today).

1 Like

Shes alive!! Thanks @kurt and @jsierles for the quick turnaround. Bit of feedback (you’re likely on this already but…) if deleting an attached volume will kill my instances, the cli should print out a warning. It seems like even in the happy path scenario any instances with the volumes attached would have been shut down (would work fine) then attempted to be re-deployed without the volumes but if the fly.toml still has the volume mounted this part will always fail. So maybe just a stern warning if fly has enough information to discern that the user is about to shoot themselves in the foot.

Also, as I think about this, if the re-deployment process failed for one instance, then the rollout should have stopped and should not have taken out my entire fleet, just one instance. Anyway, I’m making lots of assumptions on how fly works but the cost of my mistake was extremely high :sweat_smile: , if there is anything you can do on your end to make this less likely, that would be :ok_hand:

Thanks again guys! Traffic is already headed back to fly :man_cartwheeling:

1 Like