Postgres app fails on restart/deployment

Hi, I have Rails app and a postgres app. The apps were working fine during the end of January, and I don’t recall making any changes after about the 26th, but from the start of February, the postgres app is no longer working.

The postgres app will not boot and just gets stuck on “pending” during releases. I have tried scaling the pg app down and back up to restart it, or redeployed, but no luck.

Looking into it deeper, I think that the driver failure is the cause of this:

flyctl status instance <Redacted> -a swedishbirds-db                                                                                               

Instance
  ID            = <Redacted>              
  Process       = app                   
  Version       = 8                     
  Region        = lhr                   
  Desired       = stop                  
  Status        = failed                
  Health Checks = 3 total, 3 passing    
  Restarts      = 5                     
  Created       = 2023-01-25T15:01:42Z  

Recent Events
TIMESTAMP           	TYPE          	MESSAGE                                                                                                       
2023-01-25T15:01:27Z	Received      	Task received by client                                                                                      	
2023-01-25T15:01:27Z	Task Setup    	Building Task Directory                                                                                      	
2023-01-25T15:02:09Z	Started       	Task started by client                                                                                       	
2023-01-26T19:55:47Z	Terminated    	Exit Code: 2                                                                                                 	
2023-01-26T19:55:48Z	Restarting    	Task restarting in 1.084839554s                                                                              	
2023-01-26T19:55:54Z	Started       	Task started by client                                                                                       	
2023-01-26T21:13:44Z	Terminated    	Exit Code: 2                                                                                                 	
2023-01-26T21:13:44Z	Restarting    	Task restarting in 1.206766527s                                                                              	
2023-01-26T21:13:50Z	Started       	Task started by client                                                                                       	
2023-02-03T02:54:20Z	Terminated    	Exit Code: 2                                                                                                 	
2023-02-03T02:54:20Z	Restarting    	Task restarting in 1.063800869s                                                                              	
2023-02-03T02:54:26Z	Started       	Task started by client                                                                                       	
2023-02-03T17:59:46Z	Terminated    	Exit Code: 2                                                                                                 	
2023-02-03T17:59:47Z	Restarting    	Task restarting in 1.203652444s                                                                              	
2023-02-03T17:59:53Z	Started       	Task started by client                                                                                       	
2023-02-03T18:12:48Z	Terminated    	Exit Code: 2                                                                                                 	
2023-02-03T18:12:48Z	Restarting    	Task restarting in 1.080633031s                                                                              	
2023-02-03T18:12:55Z	Driver Failure	rpc error: code = Unknown desc = unable to create microvm: could not find device for volume with name pg_data	
2023-02-03T18:12:55Z	Not Restarting	Error was unrecoverable                                                                                      	

Checks
ID  	SERVICE	STATE  	OUTPUT                                                                                                                         
pg  	app    	passing	HTTP GET http://172.19.64.154:5500/flycheck/pg: 200 OK Output: [✓] transactions: read/write (216.81µs)                        	
    	       	       	[✓] connections: 11 used, 3 reserved, 300 max (3.49ms)                                                                        	
vm  	app    	passing	HTTP GET http://172.19.64.154:5500/flycheck/vm: 200 OK Output: [✓] checkDisk: 799.05 MB (81.9%) free space on /data/ (32.41µs)	
    	       	       	[✓] checkLoad: load averages: 0.09 0.20 0.25 (47.19µs)                                                                        	
    	       	       	[✓] memory: system spent 0s of the last 60s waiting on memory (27.61µs)                                                       	
    	       	       	[✓] cpu: system spent 2.27s of the last 60s waiting on cpu (16.02µs)                                                          	
    	       	       	[✓] io: system spent 0s of the last 60s waiting on io (13.82µs)                                                               	
role	app    	passing	leader                                                              

Here’s my remote config for the postgres app:

{
    "checks": {
        "pg": {
            "grace_period": "30s",
            "headers": [],
            "interval": "15s",
            "method": "get",
            "path": "/flycheck/pg",
            "port": 5500,
            "protocol": "http",
            "restart_limit": 0,
            "timeout": "10s",
            "tls_skip_verify": false,
            "type": "http"
        },
        "role": {
            "grace_period": "30s",
            "headers": [],
            "interval": "15s",
            "method": "get",
            "path": "/flycheck/role",
            "port": 5500,
            "protocol": "http",
            "restart_limit": 0,
            "timeout": "10s",
            "tls_skip_verify": false,
            "type": "http"
        },
        "vm": {
            "grace_period": "1s",
            "headers": [],
            "interval": "1m",
            "method": "get",
            "path": "/flycheck/vm",
            "port": 5500,
            "protocol": "http",
            "restart_limit": 0,
            "timeout": "10s",
            "tls_skip_verify": false,
            "type": "http"
        }
    },
    "env": {
        "PRIMARY_REGION": "lhr"
    },
    "experimental": {
        "auto_rollback": false,
        "enable_consul": true,
        "private_network": true
    },
    "kill_signal": "SIGTERM",
    "kill_timeout": 300,
    "metrics": {
        "path": "/metrics",
        "port": 9187
    },
    "mounts": [
        {
            "destination": "/data",
            "encrypted": false,
            "source": "pg_data"
        }
    ],
    "processes": [],
    "services": []
}

I believe this is the unchanged default config was generated.

I see I do still have a volume called “pg_data” so I am unsure why it could not find it, and if it’s relevant and the cause to the never-ending pending deployment I am seeing.

1 Like

Hi @RyanOfWoods, here are some possible workarounds to try:

Get the volume ID

fly volumes list -a <application-name>

Try to extend the volume

fly volumes extend <volume-id> -s <new-size>

Look for your volume snapshots

fly volumes snapshots list <volume-id>

And recreate volume

fly volumes create <volume-name> --snapshot-id <snapshot-id> -s <volume-size> [-a <app-name>]

Thanks for the reply!

Unfortunately, extending does not work, I am guessing because the volume has no current attached vm.

flyctl volumes list -a swedishbirds-db   
ID                  	STATE  	NAME   	SIZE	REGION	ZONE	ENCRYPTED	ATTACHED VM	CREATED AT   
xxxxxxxxxxxxxxxxxxx 	created	pg_data	1GB 	lhr   	ad0e	false    	           	4 months ago
LOG_LEVEL=debug flyctl volumes extend xxxxxxxxxxxxxxxxxxx --size 3 -a swedishbirds-db  
? Extending this volume will result in a VM restart. Continue? Yes
Error failed to extend volume: You hit a Fly API error with request ID: 01GRR9QNE9137X7268J1NG5TDG-arn

Logs didn’t give any insight.

Unfortunately, I do not have any snapshots of the database due how long the db has been down.

Hey Ryan,

This looks like it’s due to a Nomad failure. Would you be open to migrating this app to our Postgres on Machines platform?

1 Like

Thanks for the reply @shaun.

When you say migrate the app, do you mean that I can upgrade the app to use apps v2 (machines)? Or do you mean making a new app with machines and getting my Rails app to point to that? I see no reference in the docs about the first option, only about making machines (which is quite a bit more complex than the nomad pg apps).

I am assuming I won’t be able to pull its data off from it, as the volume is tied to the app, and the I can’t dump the database or anything without a running vm instance.

Otherwise, do you have any suggestions on how I can kill the vm instance and start it again/restart? I have tried stopping and restarting it, but no luck. I am pretty sure this is blocking all my db app restarts.

flyctl vm stop 0a7e02c3                                                                                                    [🐍 2.7.18]
Error failed to stop allocation: You hit a Fly API error with request ID: 01GRTMSY9H7GQTBP79QYV85Q4W-arn

flyctl vm restart 0a7e02c3                                                                                                 [🐍 2.7.18]
Error failed to restart allocation: You hit a Fly API error with request ID: 01GRTMWEK97JTMVBZR1Y7P840K-arn

I just tried making a brand new pg nomad app out of curiosity, and even that app got stuck at pending deployment for an hour (I killed it now).

Hey Ryan,

We can upgrade your app to use v2 machines. I certainly wouldn’t create any new PG apps on top of Nomad if you can help it, we are actively moving stateful apps off of it.

1 Like

Understood @shaun. I would be happy to migrate to the v2 machines then. Would I keep the volume and the data on it?