App stuck in pending state after restarts, rescaling

My app is suddenly down today, and I’m not sure why. I didn’t initiate a new deploy or touch anything on the server in the last few days.

It looks like at 2022-07-12T12:42:20.824 something told my server to shut down, and it seems to be stuck on the step of Umounting /dev/vdc from /data .

2022-07-12T12:42:20.824 runner[bca5e320] iad [info] Shutting down virtual machine
2022-07-12T12:42:20.829 app[bca5e320] iad [info] Sending signal SIGINT to main child process w/ PID 524
2022-07-12T12:42:20.829 app[bca5e320] iad [info] signal received, litestream shutting down
2022-07-12T12:42:20.830 app[bca5e320] iad [info] sending signal to exec process
2022-07-12T12:42:20.830 app[bca5e320] iad [info] waiting for exec process to close
2022-07-12T12:42:20.831 app[bca5e320] iad [info] litestream shut down
2022-07-12T12:42:21.831 app[bca5e320] iad [info] Main child exited normally with code: 0
2022-07-12T12:42:21.831 app[bca5e320] iad [info] Starting clean up.
2022-07-12T12:42:21.844 app[bca5e320] iad [info] Umounting /dev/vdc from /data 
$ fly status --all 
App
  Name     = picoshare          
  Owner    = personal           
  Version  = 273                
  Status   = pending            
  Hostname = picoshare.fly.dev  

Deployment Status
  ID          = 569d557d-cb05-a5f0-1dae-c04189e3f4e2         
  Version     = v273                                         
  Status      = running                                      
  Description = Deployment is running                        
  Instances   = 1 desired, 0 placed, 0 healthy, 0 unhealthy  

Instances
ID              PROCESS VERSION REGION  DESIRED STATUS          HEALTH CHECKS           RESTARTS        CREATED              
bca5e320        app     273     iad     evict   complete        1 total, 1 passing      20              2022-06-12T11:55:10Z

I tried restarting the app: no change.

I tried scaling the app (changing the RAM allocation): no change.

I can see the new releases in the Fly dashboard, but the logs don’t update at all, and fly status --all has the same output.

This seems to be something about my app in particular. I have a different version of the same app running in a different Fly account, and it’s stuck in the same state:

https://tinypilot-pico.fly.dev/

Both instances mount a 3 GB persistent volume in iad, so I’m wondering if there’s some issue with that DC.

Based on this comment, I was able to get up and running again by scaling to a dedicated CPU.

If I scale back down to a shared CPU, I get stuck in pending state again.

You should be able to scale back down now. We have freed some space on the server where your volume is.

Thanks, confirmed.

Is there anything I can do to avoid this in the future short of always running with a dedicated CPU?

This is a bug in our infrastructure. Your VM got stopped when a particular host had capacity issues. Since your volume was on that one exact host, you couldn’t boot a new VM.

Switching to a dedicated CPU actually evicts other VMs on that host running shared-cpus. It seemed like a good idea when we initially built this, but now I believe we’re better off making new VMs fail. It’ll take us some time to redo this plumbing, but it’s a high priority.

The “real” answer in our infrastructure is to run >=2 VMs for max redundancy. Which obviously doesn’t work (yet) with sqlite.

Gotcha, thanks, @kurt!

My apologies to whomever I evicted. :grimacing:

I knew this was Ben Johnson’s fault!

1 Like

It is, litestream will solve every problem anyone’s ever had.

3 Likes