Reuse old volume in new machine after "irrepairable hardware damage"

Thanks! The dates were exactly what I was fishing for, actually.

(I should have mentioned that.)

This is almost certainly the explanation, then… The Postgres container images (at least the ones that I’ve seen) are very enthusiastic about auto-creating a fresh database when they don’t find one in exactly the place they expected. Most likely, that place drifted from /data/postgresql/ to /data/postgres/ sometime after you created the database.

Let me think a little about the safest next steps, though…

2 Likes

Thanks for the help @mayailurus!

I’m not the db expert but I was thinking that /data/postgresql/ contains tools and processes and /data/postgres/ is for data. Nevertheless, I’m clueless about next steps so waiting and hopping that you can figure out something…

Not a horrible guess! Typically, Debian-based systems try to keep the former in /usr/bin/, /usr/lib/, and the like, but on Windows it would be pretty normal to have them closer together.

You can poke around a little in there yourself, if you like:

$ fly ssh console -a shy-river-7114
# cd /data/postgresql
# cat PG_VERSION
# ls -lF

This gives you a Unix-style shell, so executable files will show up with an asterisk (*) after their names.

(Be careful with other commands, though, since you do have full superuser powers.)

Another sometimes reassuring thing that can be tried in there is…

# find /data/postgresql -type f -print0 | xargs -0 fgrep -i invoice

This will search for all files containing the string “invoice”. (More generally, use something that would be characteristic of your own, particular dataset.)

On typical PG files, it will announce binary file matches on success.


Edit: changed the example string from “receipt” → “invoice”, since actually “receipt” does have matches in the auto-created database. Oops!

1 Like

Using the find command I can see matches which are specific to the dataset! It seems that data is there :partying_face:

Could our issue be related to the fact that we use the wrong --image-ref ? The first time I created a machine I tried without the --image-ref and it would not start, but then later I had to guess the value…

2 Likes

Maybe… The cat PG_VERSION above should tell you the major version, at least.

Did you definitely create it with fly pg create originally, i.e., last summer?

Yes, it was exactly that day. If I try cat PG_VERSION I get 15 which does not seem to be one of the versions available https://hub.docker.com/r/flyio/postgres/tags

2 Likes

Same on my side, data is there, there is no version 15 on that repo but I noticed that new instances of pg are using postgres-flex which is found here

All versions are here, I will try to restore using this flex version.

1 Like

I had the same thinking for this one, and also faced connection issues without --image-ref.

But now no luck with the flex version flyio/postgres-flex:15, I’m getting this for version 15:

WARNING: database "postgres" has a collation version mismatch
DETAIL: The database was created using collation version 2.31, but the operating system provides version 2.36.
HINT: Rebuild all objects in this database that use the default collation and run ALTER DATABASE postgres REFRESH COLLATION VERSION, or build PostgreSQL with the right library version.

That might actually be progress, though, since this new error is a known, fixable problem:

1 Like

Yes but the issue is that we have restored data, so the pg service it’s not starting at all on machine start so I can not connect to the database to run pg commands described in the issue above.

FATAL: database "repmgr" does not exist
failed post-init: failed to enable repmgr: failed to create repmgr database: ERROR: template database "template1" has a collation version mismatch (SQLSTATE XX000)

Hm… Maybe temporarily try an earlier 15.x release image instead, just so you definitely have your data back.

It looks like another user might have had luck with flyio/postgres-flex:15.1

https://community.fly.io/t/cant-recreate-pg-from-snapshot-the-database-was-created-using-collation-version-2-31-but-the-operating-system-provides-version-2-36/19480

1 Like

THAT IS IT! @mayailurus THANKS a lot for your assistance. Probably it will work for you too @Tommaso

So for everyone who lands here with the similar issue this is what worked for me at the end:

fly postgres create --snapshot-id YOUR_SNAPSHOT_ID --image-ref flyio/postgres-flex:15.1
2 Likes

Resolved for me too! Thank you so much @mayailurus for the support!

1 Like

I don’t have a primary_region defined at all. But the server is spawned in AMS and the volume is in AMS so they seem to end up in the same region anyway?

image

and

image

I should add that I don’t have a postgres associated. I just have a volume where I write to the filesystem from the application

Odd… It might be getting confused by the unreachable volume that is also in Amsterdam.

It would help to have the outputs of fly m list and fly vol list, so we forum readers are not so much in the dark about minutiae. (The screenshots are useful, but a little short on details.)

Also, are you using GPUs?

Added postgres, for the other, earlier parts of the discussion.

No problems

➜ fly m list
1 machines have been retrieved from app simple-sgt.
View them in the UI here (​https://fly.io/apps/simple-sgt/machines/)

simple-sgt
ID            	NAME               	STATE  	CHECKS	REGION	ROLE	IMAGE                                           	 ADDRESS                    	OLUME              	REATED             	AST UPDATED        	ROCESS GROUP	IZE
7815de5a9900d8	spring-firefly-8065	started	1/1   	ams   	    	simple-sgt:deployment-01J4YAWF695CD72Q7YX4R3Q1S1	aa:0:bb64:a7b:42:c66b:4feb:2	ol_4qym70x9qql5w36v	024-08-10T14:52:58Z	024-08-10T14:53:11Z	pp          	hared-cpu-1x:256MB

sig/simple-sgt on   (git)-[main|merge]- [⇕=✘!+?] is 📦 v1.0.0 via ⬢ v20.9.0 took 3s
➜ fly vol list
ID                   	STATE  	NAME        	SIZE	REGION	ZONE	ENCRYPTED	ATTACHED VM   	CREATED AT
vol_n0l9vl2ddz8r635d*	created	sig_sgt_data	1GB 	ams   	f6b7	true     	              	1 year ago 	
vol_vppoe3mgn552qzwv 	created	sig_sgt_data	1GB 	ams   	7ef4	true     	              	1 month ago	
vol_4qym70x9qql5w36v 	created	sig_sgt_data	1GB 	ams   	759e	true     	7815de5a9900d8	3 days ago 	

* These volumes' hosts could not be reached.
sig/simple-sgt on   (git)-[main|merge]- [⇕=✘!+?] is 📦 v1.0.0 via ⬢ v20.9.0 took 6s
➜ fly volumes snapshots list vol_n0l9vl2ddz8r635d
Snapshots
ID                   	STATUS 	SIZE    	CREATED AT	RETENTION DAYS
vs_6PDn9jjaq9RGholPjO	created	39440578	5 days ago	60            	
vs_6PDn9jjaq9RGho1oDO	created	39440578	6 days ago	60            	
vs_VoRy4qqzv42btzg7G5	created	39440578	1 week ago	60            	
vs_XoazJxxb9J3At6zxN4	created	39440578	1 week ago	60            	
vs_M9ZkX33M5X6vtZXlb8	created	39440578	1 week ago	60            	
vs_8YznlbbGpl2XI0P7y 	created	39440578	1 week ago	60   

1 Like

Thanks… Is vol_vppoe3mgn552qzwv the one that you created from a snapshot? Its creation date is much older than I would have expected…

Ok. So I do the following

➜ fly volumes create sig_sgt_data --snapshot-id vs_M9ZkX33M5X6vtZXlb8 -s 1
Some regions require a Launch plan or higher (bom, fra).
See https://fly.io/plans to set up a plan.

? Select region: Amsterdam, Netherlands (ams)
                  ID: vol_4qynyk60n20w9o6v
                Name: sig_sgt_data
                 App: simple-sgt
              Region: ams
                Zone: 759e
             Size GB: 1
           Encrypted: true
          Created at: 13 Aug 24 19:59 UTC
  Snapshot retention: 5
 Scheduled snapshots: true
sig/simple-sgt on   (git)-[main|merge]- [⇕=✘!+?] is 📦 v1.0.0 via ⬢ v20.9.0 took 13s
➜ fly scale count 1
App 'simple-sgt' is going to be scaled according to this plan:
  +1 machines for group 'app' on region '' of size 'shared-cpu-1x'
  +1 volumes  for group 'app' in region ''
? Scale app simple-sgt? Yes
Executing scale plan
  Creating volume sig_sgt_data region: size:1GiB
  Created 185e64eb449148 group:app region: size:shared-cpu-1x volume:vol_vg7p7m8wye7kz3pv

Ie, I create a new volume from a snapshot, then do the scale command, and as you can see the scale command even logs that it creates a new volume that it starts using rather than the one I just created

1 Like