Postgres stuck in pending state after running image update

chloerei · April 25, 2022, 3:42am

Deployment Status
  ID          = 2381085a-bf46-b34b-c987-cb68104a6b7b         
  Version     = v4                                           
  Status      = running                                      
  Description = Deployment is running                        
  Instances   = 1 desired, 0 placed, 0 healthy, 0 unhealthy

HKG region.

After it got stuck, I manually stopped the original VM as I thought it might be scrambling for available volume, but it didn’t help.

Please tell me how to restore postgres, my data has no external backup except volume snapshots.

chloerei · April 25, 2022, 4:14am

About why I try to update image. After this problem still in Deployment, I can’t connect database from the web app, it says host not found. So I tried redeploying the db to solve the problem. As a result, the problem becomes more complicated.

chloerei · April 25, 2022, 5:09am

I try scale vm, still not work:

App
  Name     = geeknote-postgres          
  Owner    = geeknote-net               
  Version  = 6                          
  Status   = pending                    
  Hostname = geeknote-postgres.fly.dev  

Deployment Status
  ID          = 6b1cd140-70f2-e66d-0d66-91c2431a3a4f         
  Version     = v6                                           
  Status      = running                                      
  Description = Deployment is running                        
  Instances   = 1 desired, 0 placed, 0 healthy, 0 unhealthy  

Instances
ID	PROCESS	VERSION	REGION	DESIRED	STATUS	HEALTH CHECKS	RESTARTS	CREATED

chloerei · April 25, 2022, 8:33am

I change vm size then it success create instance:

flyctl scale vm dedicated-cpu-1x -a geeknote-postgres

Seems like shared-cpu-1x is running out of resources?

Now I have the first problem, can’t access the db:

$ flyctl ssh console -a geeknote-postgres
Error host unavailable: host was not found in DNS

chloerei · April 25, 2022, 8:41am

scale down to shared-cpu-1x , instance is created but stuck in pending state:

App
  Name     = geeknote-postgres          
  Owner    = geeknote-net               
  Version  = 12                         
  Status   = running                    
  Hostname = geeknote-postgres.fly.dev  

Deployment Status
  ID          = da7927a7-780d-f2af-2415-036a95bb0308         
  Version     = v12                                          
  Status      = running                                      
  Description = Deployment is running                        
  Instances   = 1 desired, 1 placed, 0 healthy, 0 unhealthy  

Instances
ID      	PROCESS	VERSION	REGION	DESIRED	STATUS 	HEALTH CHECKS	RESTARTS	CREATED    
9f20a649	app    	12 ⇡   	hkg   	run    	pending	             	0       	2m11s ago 	
6a9a81a3	app    	11     	hkg   	stop   	pending	             	0       	10m20s ago	
377b3d9f	app    	10     	hkg   	stop   	pending	             	0       	12m4s ago 	
142e92c3	app    	9      	hkg   	stop   	pending	             	0       	17m32s ago

zee · April 25, 2022, 2:39pm

Hey there, can you share the logs output from flyctl logs so we can see if there’s any valuable information about what’s going on there, please?
We did bring up more server capacity in hkg, but that was about a day ago and it seems you’re still experiencing issues with your app in pending status, so we need to make sure there’s nothing else going on.

chloerei · April 25, 2022, 2:51pm

The log only shows that I manually stopped the instance, and there is no log output for the subsequent changes.

2022-04-25T03:31:50.733 app[a25370f1] hkg [info] exporter | signal: interrupt
2022-04-25T03:31:50.739 app[a25370f1] hkg [info] keeper   | 2022-04-25 03:31:50.734 UTC [686] LOG:  aborting any active transactions
2022-04-25T03:31:50.744 app[a25370f1] hkg [info] keeper   | 2022-04-25T03:31:50.732Z INFO postgresql/postgresql.go:384 stopping database
2022-04-25T03:31:50.744 app[a25370f1] hkg [info] keeper   | 2022-04-25 03:31:50.742 UTC [686] LOG:  background worker "logical replication launcher" (PID 693) exited with exit code 1
2022-04-25T03:31:50.744 app[a25370f1] hkg [info] keeper   | 2022-04-25 03:31:50.742 UTC [688] LOG:  shutting down
2022-04-25T03:31:50.748 app[a25370f1] hkg [info] sentinel | 2022-04-25T03:31:50.744Z INFO cmd/sentinel.go:1816 stopping stolon sentinel
2022-04-25T03:31:50.756 app[a25370f1] hkg [info] proxy    | exit status 130
2022-04-25T03:31:50.757 app[a25370f1] hkg [info] sentinel | Process exited 0
2022-04-25T03:31:50.769 app[a25370f1] hkg [info] keeper   | waiting for server to shut down....2022-04-25 03:31:50.768 UTC [686] LOG:  database system is shut down
2022-04-25T03:31:50.845 app[a25370f1] hkg [info] keeper   |  done
2022-04-25T03:31:50.845 app[a25370f1] hkg [info] keeper   | server stopped
2022-04-25T03:31:50.857 app[a25370f1] hkg [info] keeper   | Process exited 0
2022-04-25T03:31:51.734 app[a25370f1] hkg [info] Main child exited normally with code: 0
2022-04-25T03:31:51.735 app[a25370f1] hkg [info] Starting clean up.
2022-04-25T03:31:51.755 app[a25370f1] hkg [info] Umounting /dev/vdc from /data

chloerei · April 25, 2022, 2:51pm

mention

zee · April 25, 2022, 3:46pm

Alright, thank you so much for that!
It turns out we’re actually experiencing some hardware issues in hkg, you can check the status page for updates as we work on this issue.

I do apologize for that inconvenience!

chloerei · April 27, 2022, 9:32am

Problem solved, thanks a lot.

I have some suggestions:

This is the third time I have encountered the problem of resource exhaustion of hkg, and two of them caused the server to go down. Such problems should be discovered in advance through monitoring.
Consider introducing a support ticket system. The advantage of the community is that you can view everyone’s problems and contact developers, but when you encounter an urgent problem, not knowing when it will be dealt with will make me panic.

Past experience does not reassure me about recommending fly.io for production applications. Hope fly.io improves stability and customer support. thanks again.

zee · April 27, 2022, 3:17pm

Glad you’re up and running again!
We do have monitoring systems in place, this was actually a new bug that our monitoring didn’t detect properly. We add additional checks when this type of situation comes up.

We do recommend that you run 2+ instances of your database that need high availability. In this case your app would have stayed online even with one instance failing with that kind of setup. This just offers you a fail-safe from these types of hardware failures.

ben2 · July 12, 2022, 2:31pm

We have both a “dev” and “prod” deployment of flyio/redis:6.2.6 in our environment.
They were also at the shared-cpu-1x VM size with 256MB memory.

And this, morning, they were both stuck in “pending”, in which the app logs showed:

2022-07-12T12:42:30Z runner[f0d06629] iad [info]Shutting down virtual machine
2022-07-12T12:42:30Z app[f0d06629] iad [info]Sending signal SIGINT to main child process w/ PID 523
2022-07-12T12:42:30Z app[f0d06629] iad [info]redis   | Interrupting...
2022-07-12T12:42:30Z app[f0d06629] iad [info]metrics | Interrupting...
2022-07-12T12:42:30Z app[f0d06629] iad [info]redis   | 538:signal-handler (1657629750) Received SIGINT scheduling shutdown...
2022-07-12T12:42:30Z app[f0d06629] iad [info]metrics | signal: interrupt
2022-07-12T12:42:30Z app[f0d06629] iad [info]redis   | 538:M 12 Jul 2022 12:42:30.259 * DB saved on disk
2022-07-12T12:42:30Z app[f0d06629] iad [info]redis   | 538:M 12 Jul 2022 12:42:30.259 # Redis is now ready to exit, bye bye...
2022-07-12T12:42:30Z app[f0d06629] iad [info]redis   | signal: interrupt
2022-07-12T12:42:31Z app[f0d06629] iad [info]Main child exited normally with code: 0
2022-07-12T12:42:31Z app[f0d06629] iad [info]Starting clean up.
2022-07-12T12:42:31Z app[f0d06629] iad [info]Umounting /dev/vdc from /data

I couldn’t get it restarted, trying to scale count to 0, back to 1, and trying new deployment attempts all failed…

To back up the above comments, doing a fly scale vm dedicated-cpu-1x --app got those redis instances back up and running.