DB server started, unreachable, rails app down

Hello dear community,

My app went down and I went in to check why. Seems, db server is unreachable, error: no active leader found. I did some investigation and found out the “image” needs to be updated.

This is what I get when I check the status of the DB app:

Updates available:

Machine "e784e2e0" flyio/postgres-flex:15.3 (v0.0.42) -> flyio/postgres-flex:15.3 (v0.0.45)

Run `flyctl image update` to migrate to the latest image version.
ID            	STATE  	ROLE 	REGION	CHECKS                        	IMAGE                             	CREATED             	UPDATED              
e784e2e0	started	error	sea   	3 total, 1 passing, 2 critical	flyio/postgres-flex:15.3 (v0.0.42)	2023-07-02T06:18:23Z	2023-10-08T16:25:19Z

I tried running “image update” and I get the following response:

The following changes will be applied to all Postgres machines.
Machines not running the official Postgres image will be skipped.

  	... // 85 identical lines
  	    }
  	  },
- 	  "image": "flyio/postgres-flex:15.3@sha256:c380a6108f9f49609d64e5e83a3117397ca3b5c3202d0bf0996883ec3d",
+ 	  "image": "registry-1.docker.io/flyio/postgres-flex:15.3@sha256:5e5fc53decb051f69b0850f0f5d137c92343fcd1131ec413015e526062",
  	  "restart": {
  	    "policy": "on-failure",
  	... // 8 identical lines
  	
? Apply changes? Yes
Identifying cluster role(s)
  Machine e784e2e0: error
Postgres cluster has been successfully updated!

But nothing works. Restarting, stopping, scaling, redeploying, connecting… nothing.

Health Checks for basira-website-db
  NAME | STATUS   | MACHINE        | LAST UPDATED | OUTPUT                                                                                                                                                                                                                     
-------*----------*----------------*--------------*----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  pg   | critical | e784e2e0c47198 | 48m49s ago   | 500 Internal Server Error                                                                                                                                                                                                  
       |          |                |              | failed to connect with local node: failed to connect to `host=fdaa:2:70e4:a7b:f9:54c9:92f8:2 user=flypgadmin database=postgres`: dial error (dial tcp [fdaa:2:70e4:a7b:f9:54c9:92f8:2]:5433: connect: connection refused)  
-------*----------*----------------*--------------*----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  role | critical | e784e2e0c47198 | 47m55s ago   | 500 Internal Server Error                                                                                                                                                                                                  
       |          |                |              | failed to connect to local node: failed to connect to `host=fdaa:2:70e4:a7b:f9:54c9:92f8:2 user=repmgr database=repmgr`: dial error (dial tcp [fdaa:2:70e4:a7b:f9:54c9:92f8:2]:5433: connect: connection refused)          
-------*----------*----------------*--------------*----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  vm   | passing  | e784e2e0c47198 | 47m47s ago   | [✓] checkDisk: 864.23 MB (87.7%) free space on /data/ (88.48µs)                                                                                                                                                            
       |          |                |              | [✓] checkLoad: load averages: 0.00 0.00 0.00 (82.78µs)                                                                                                                                                                     
       |          |                |              | [✓] memory: system spent 0s of the last 60s waiting on memory (66.44µs)                                                                                                                                                    
       |          |                |              | [✓] cpu: system spent 144ms of the last 60s waiting on cpu (51.75µs)                                                                                                                                                       
       |          |                |              | [✓] io: system spent 198ms of the last 60s waiting on io (29.96µs)                                                                                                                                                         
-------*----------*----------------*--------------*----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

I’m a big dumdum and I did not backup in over 2 months, and I need the data, so if possible, I’d like to keep it.

Edit: machine id is ‘e784e2e0c47198’.

Thank you.

Resources: thank you Eric!

Did Eric Workman’s tips resolve this for you, or are you still unable to connect?

(The scope of the “Edit:” above is a little ambiguous.)

If not, I’d recommend finding a volume snapshot that predates the failed upgrade and then copying it into a new, permanent volume. The idea is to always keep at least one pristine, unmodified copy of the old filesystem.

https://fly.io/docs/apps/volume-manage/#restore-a-volume-from-a-snapshot

That way, you’ll know you’ll have, as a fallback, at least one route to recovering your data eventually—in case tweaks to the more complicated clustered/proxied arrangement do not pan out.

Thank you for your reply. I really do appreciate it.

I was eventually able to connect to the db server using:

fly ssh console -a db

However, I could not find my data, or run the server. Seems there is no “cluster”.

service postgresql start
No PostgreSQL clusters exist; see "man pg_createcluster" ... (warning).

The command “psql” returned:

psql: error: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: No such file or directory
	Is the server running locally and accepting connections on that socket?

I didn’t attempt to create a new cluster, fearing I might permanently lose all my data (which I now think I already have lost).

What I meant in my “Edit” is to give whoever is trying to support me from the fly.io community the real ID of the machine with the issue. I’d modified it in the “code” I’d provided before then.

The most important question is: how can I find my data? The folder: /var/lib/postgresql/data is empty.

Thank you!

The first step would be to ssh in to the server again and run lsblk. That will tell you where in the filesystem the volume (typically vdb) is actually mounted (MOUNTPOINT).

Then try…

find /that/directory -iname PG_VERSION

If that turns up nothing, then find /that/directory -iname 'pg_*'.

[And if all else fails, you probably do still have volume snapshots available (fly vol snapshots) from before the upgrade; Fly periodically takes such snapshots (automatically) and holds them from ~5 days.]

Oh you’re good. Thank you.

lsblk returned:

NAME MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
vda  254:0    0    8G  0 disk /
vdb  254:16   0 1020M  0 disk /data

find /data -iname PG_VERSION, returned:

/data/postgresql/base/16509/PG_VERSION
/data/postgresql/base/5/PG_VERSION
/data/postgresql/base/4/PG_VERSION
/data/postgresql/base/16386/PG_VERSION
/data/postgresql/base/1/PG_VERSION
/data/postgresql/PG_VERSION

Now what? I’m sorry if I’m asking for too much, you can guide me to a webpage if I am.

Again, thank you very much.

No worries! Part of the reason I’m on the forum is to practise writing…

The next step is to double-check that this is your data. Think of a string that you know is in your database—but would not be in an auto-created Postgres placeholder.

For example, if this is a cat-names database, then…

find /data -print0 | xargs -0 fgrep -i calico

Evidence was found if Binary file matches.

Assuming the green light is indeed given, we then want:

cat /data/postgresql/PG_VERSION

(The diff in your original post suggests that this is 15.3, but given the mismatches already, we need to be extra careful.)

At this point, there’s a short paragraph of broken record (sorry!) about the permanent snapshot copy… Volume snapshots gradually disappear, and we absolutely do want to have something to roll back to in case the (pending) experiments go awry. Hence, it would be prudent to get the contents of one of them into a place that is designated for storing things long-term.

(It would be best to copy from the oldest snapshot that is still available; it’s ok if it doesn’t date from before the upgrade.)

After that, the exact steps will depend on the version number you saw… The broad plan is to try fly postgres create with a snapshot off this volume as the initial basis for the new Machine’s storage. There’s a fair chance the mount point will be set correctly this time, since there is no prior configuration to muddle things.

(If not, then the world doesn’t end. We’ll just take the traditional pg_dump/pg_restore route instead.)

1 Like

Thank you for everything. I really do appreciate you taking the time to reply to my inquiries. I have learned a LOT as a result.

While I could not recover my data, I got the site back up and working again. I’ve asked the team to re-enter most of the data and we’re hoping for the best.

Thank you again for everything.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.