Reuse old volume in new machine after "irrepairable hardware damage"

I’ve been running a machine for quite some time. Got an email the other day about “Some of your apps in AMS are on a host which has suffered a hardware failure and will be down for an extended period.”

Then a couple of days ago I got this new email that it won’t recover with a link how to get up again and how to reuse a volume by creating a new volume based on snapshots from old volume

So I try to follow “Apps with one Machine and an attached volume” and it say to

  1. fly volumes list
  2. fly volumes snapshots list
  3. fly volumes create --snapshot-id -s
  4. fly scale count 1

So I did all that and at step 3 it creates a volume and at step 4 it deploys alright but when deploying it also creates a new volume rather than using the one I created in step 3

How can I force my app to use the volume I create from a snapshot?

Some of the commands where formatting mangled when pasting but you still get it? (they were just copy pasted from the linked page)

I have exactly the same problem, I followed all the steps and restored form the snapshot but the issue is in the instance from the snapshot it’s the new volume and not the previous I was using meaning that the database is empty in my case. Any insight on how to solve this issue?

I was assuming if you restore from the snapshot it should be a replica of the original and just detach old instance and attach new?

Hi… Sorry you’re having so much trouble with this… Volumes are tied to specific regions, which may be the source of the problem.

Perhaps you could post your fly.toml, particularly the primary_region, as well as the outputs of the fly volumes list and fly m list commands?


Aside: You can use triple-backticks (```) to avoid formatting problems, :dragon:. E.g.,

```text
$ fly volumes list
$ fly m list
```

The above would come out as…

$ fly volumes list
$ fly m list

Added volumes

Basically, I’m unable to attach this volume to any pg app. If I create a new pg app from the snapshot of this volume there is no previous data just an empty database.

----

ID                   	STATE  	NAME   	SIZE	REGION	ZONE	ENCRYPTED	ATTACHED VM	CREATED AT
vol_g67340kjpj2vydxw*	created	pg_data	20GB	ams   	f6b7	true     	           	1 year ago

* These volumes' hosts could not be reached.

Tried clone, fork, and attach all the options, and I always got the same outcome. Btw due to irreparable hardware damage all my machines were destroyed.

Yeah, these cases are pretty harsh. Sorry you ran into that!

I’m mainly asking about the new machines and new volumes that were created subsequently, though, :adhesive_bandage:. There are many small things that could have slipped out of alignment, at this point.

The first part is true. However, the second is not necessarily so simple, with Postgres.

Could you try creating a new PG machine from a snapshot, again?

We can start by verifying the machines and volumes lists and then looking inside with fly ssh console

(Also, it would help if you could report the exact, entire invocations that were used, since the details really do matter here.)

Sure, so it goes like this:

My machines are destroyed for pg app ht-db-cluster, volume is there but shows usage 0/20GB so it’s empty, so I want to restore it from snapshot like this:

fly volumes list -a ht-db-cluster

ID                   	STATE  	NAME   	SIZE	REGION	ZONE	ENCRYPTED	ATTACHED VM	CREATED AT
vol_g67340kjpj2vydxw*	created	pg_data	20GB	ams   	f6b7	true     	           	1 year ago
fly volumes snapshots list vol_g67340kjpj2vydxw

ID                   	STATUS 	SIZE     	CREATED AT	RETENTION DAYS
vs_4B2RqvpBKYVack7MVZ	created	611561223	3 days ago	60
vs_4B2RqvpBKYVacNa9Z 	created	611561222	4 days ago	60
vs_zGQ7mg3GL4KaHMDm62	created	611561222	5 days ago	60
vs_91XqnLQ1mzwlh74ngl	created	611561230	6 days ago	60
vs_RjVJ4y3jZGQPU1kynp	created	611561228	1 week ago	60
vs_P273e8M2Z5BycyJp0R	created	611561229	1 week ago	60
fly postgres create --snapshot-id vs_4B2RqvpBKYVack7MVZ -r ams --image-ref flyio/postgres:14.6

And I get the connection string for the new instance and connect from client but the database is empty even if it shows usage of 1+gb

fly m list -a shy-river-7114

ID            	NAME               	STATE  	CHECKS	REGION	ROLE  	IMAGE                        	IP ADDRESS                    	VOLUME              	CREATED             	LAST UPDATED        	APP PLATFORM	PROCESS GROUP	SIZE
185e649c492498	holy-waterfall-8460	started	3/3   	ams   	leader	flyio/postgres:14.6 (v0.0.41)	fdaa:0:4f49:a7b:3e:fce6:a664:2	vol_vlp976oe2395gzp4	2024-08-11T09:51:36Z	2024-08-11T09:57:02Z	v2          	             	shared-cpu-2x:4096MB
----
fly vol list -a shy-river-7114

ID                  	STATE  	NAME   	SIZE	REGION	ZONE	ENCRYPTED	ATTACHED VM   	CREATED AT
vol_vlp976oe2395gzp4	created	pg_data	20GB	ams   	95a6	true     	185e649c492498	35 minutes ago

These are all steps taken from my side, and what ever I do db is always empty on new app even the volume shows usage.

Thanks for the assistance, I’m in a real need of the data that was in the db.

1 Like

Hello, I received the “irreparable hardware damage” email too :cold_sweat: and I am in the same situation.

I have 60days snapshot retention for my postgresql app:

fly volumes snapshot list vol_ez1nvxw19pzrmxl7

ID                   	STATUS 	SIZE        CREATED AT 	RETENTION DAYS
vs_KDn1PAKqB9vvcZB8GP	created	116154277	2 days ago 	60
vs_KDn1PAKqB9vvcyOkeP	created	116142405	3 days ago 	60
vs_A54JgYOkXQjjCgJanR	created	116137943	4 days ago 	60
vs_Zk9ZYNK3e2wwsN2YV 	created	116120776	5 days ago 	60
vs_M9ZwBRKqmlvvFL2k6j	created	116120776	6 days ago 	60
vs_8YzNKg5BqMXXTe3DGx	created	116120776	1 week ago 	60
...
vs_M9ZwBRKqmlvvFZmvb8	created	116120777	1 month ago	60
vs_qw1B5z3VAoeeilOzv 	created	116120777	1 month ago	60
vs_lA8p5BJMX9NNhMmmOk	created	116120777	1 month ago	60

but if I try to create a new postgres app from any of the snapshots

fly postgres create --snapshot-id vs_A54JgYOkXQjjCgJanR --image-ref flyio/postgres:13

the database is empty beside the default tables:

postgres=# \l
                                    List of databases
   Name    |   Owner    | Encoding |  Collate   |   Ctype    |     Access privileges
-----------+------------+----------+------------+------------+---------------------------
 postgres  | flypgadmin | UTF8     | en_US.utf8 | en_US.utf8 |
 template0 | flypgadmin | UTF8     | en_US.utf8 | en_US.utf8 | =c/flypgadmin            +
           |            |          |            |            | flypgadmin=CTc/flypgadmin
 template1 | flypgadmin | UTF8     | en_US.utf8 | en_US.utf8 | =c/flypgadmin            +
           |            |          |            |            | flypgadmin=CTc/flypgadmin

Am I doing something wrong or do the snapshots really contain no updated data?

2 Likes

Thanks for all the details! It looks like this one might just be a mismatch with the connection string…

What do you see from the following?

$ fly pg db list -a shy-river-7114

(This is another way of doing @Tommaso’s \l.)

This is the output

fly pg db list -a shy-river-7114
----

NAME    	USERS
postgres	flypgadmin, postgres, repluser

Hm… The 1GB usage strongly suggests that your data is there, somewhere, but it looks like we’re going to have to do more work to find it…

(A fresh PG volume is more like 100MB.)

$ fly ssh console -C 'df -h' -a shy-river-7114

(@Tommaso, you might want to try this on your end, as well.)

This will both confirm sizes and (possibly) suggest mountpoint discrepancies…

fly ssh console -C 'df -h' -a shy-river-7114

----

Connecting complete
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        2.0G     0  2.0G   0% /dev
none            7.8G  8.4M  7.4G   1% /
/dev/vdb        7.8G  8.4M  7.4G   1% /.fly-upper-layer
shm             2.0G   76K  2.0G   1% /dev/shm
tmpfs           2.0G     0  2.0G   0% /sys/fs/cgroup
/dev/vdc         20G  161M   19G   1% /data
1 Like

This is what I get running that command

Filesystem      Size  Used Avail Use% Mounted on
devtmpfs         94M     0   94M   0% /dev
none            7.8G  8.4M  7.4G   1% /
/dev/vdb        7.8G  8.4M  7.4G   1% /.fly-upper-layer
shm             107M   44K  107M   1% /dev/shm
tmpfs           107M     0  107M   0% /sys/fs/cgroup
/dev/vdc        986M  152M  767M  17% /data

Thank you @mayailurus for the help

1 Like

I’m just wondering if the data is encrypted on the volume, can we even use it without a machine that was used to encrypt, since it got destroyed? Data might be there but locked by another process or something

I believe those keys are centrally managed, but that was good thinking, overall…


The discrepancy between the internally reported 161M usage and the 1,259 MB in the screenshot is giving me pause, here. That might just be the difference between the filesystem view and the block-device view, though.

Did this database have a lot of writes and deletes in the past?

Also, does 1GB sound roughly correct to you in terms of plausible, total data?


Multiples of things have definitely been a problem in the past… (We ruled out one classic situation already, with \l.)

Here’s another that is sometimes useful:

$ fly ssh console -C 'find / -name PG_VERSION' -a shy-river-7114

This will detect multiple PG clusters on the same machine.

(A normal PG volume will have multiple hits here, too, so it takes some interpretation.)

Not sure, it might be also like between 50mb and 200mb, in this db were some e-commerce orders and receipts stored and I did not backed up it manually counting on the backup of the system. Also strapi CMS installation and configuration records.

I’m a bit lost here now

/data/postgresql/PG_VERSION
/data/postgresql/base/1/PG_VERSION
/data/postgresql/base/16514/PG_VERSION
/data/postgresql/base/16513/PG_VERSION
/data/postgresql/base/5/PG_VERSION
/data/postgresql/base/4/PG_VERSION
/data/postgresql/base/16386/PG_VERSION
/data/postgres/PG_VERSION
/data/postgres/base/1/PG_VERSION
/data/postgres/base/13757/PG_VERSION
/data/postgres/base/13756/PG_VERSION
1 Like

Ah! This is multiple something, right?

Both postgres and postgresql (with the ql suffix).

Odds are good that the larger one is your old data.

(I made a completely new Stolon PG cluster for comparison, and it only had /data/postgres/.)

Try the following:

$ fly ssh console -C 'ls -l /data/postgresql/PG_VERSION /data/postgres/PG_VERSION' -a shy-river-7114

I tried your latest command @mayailurus the only thing I notice is that the second row has a date possibly related to when I initially created the database:

-rw------- 1 stolon stolon 3 Aug 11 11:01 /data/postgres/PG_VERSION
-rw------- 1 stolon stolon 3 Jul 12  2023 /data/postgresql/PG_VERSION

Same here, it might be the image that I’m using, it’s 14.6, maybe pg 13 stores data in a diferent path

----

Connecting complete
-rw------- 1 stolon stolon 3 Aug 11 09:57 /data/postgres/PG_VERSION
-rw------- 1 stolon stolon 3 Jun 13  2023 /data/postgresql/PG_VERSION