Where is our Data? (Postgres Database Volume 0MB)

Opening this question in case anyone else is in the same boat (having their Nomad-based Postgres instance killed …)

Context

We deployed our auth (Elixir/Phoenix) app to Fly.io in 2022. :rocket:
It was a good experience and worked flawlessly until they killed it. :neutral_face:

Fly.io App Disabled :skull:

We received the following email from Fly.io:

Fulltext of the email for Google indexing/findability :e-mail:

Hi. We’re in the process of removing Nomad from Fly.io. This means any apps that haven’t been upgraded to Apps V2 will no longer function.

Your app authprodb is one of those apps. Here’s what this means for you:

1. authprodb is currently disabled.
  - That means the Nomad instances (virtual machines) 
     that ran your app have been shut down, but that the app’s configuration still exists.
2. Your app is technically now on the “machines” platform, rather than “nomad.”
  - That means that any subsequent fly deploy on your app will attempt to launch Fly Machines.
  - In many cases, a deploy will immediately work. In some cases, there may be some additional configuration required to get your app running on Machines. 
    The community forum is a great place to get help troubleshooting your deployment if you get stuck.
3. Nothing has been deleted.
  - Your app’s Docker image still exists, and any Volumes that were attached to your app still exist.

Thanks for bearing with us throughout this whole migration saga. It’s not been easy, but the results should soon speak for themselves.

If you have any questions or comments, we’ve posted a similar announcement on the community forum.

Thanks, Your Friends at Fly.io

They key point of the email:
they shut down our Postgres instance
and according to point 3Nothing has been deleted”.

Nothing has been deleted” … :thought_balloon:

So I dived into trying to recover the database.
Tried re-deploying the authprod App that was attached to the authprodb Postgres instance:

fly deploy --verbose

It errors because the Elixir App cannot run the mix release command:

Error: release command failed - aborting deployment. error release_command machine 48ed037a343368 exited with non-zero status of 1

Full log output in: Internal Server Error 😢 🔥 · Issue #325 · dwyl/auth · GitHub

After a couple of hours searching and trying everything I could think of with zero progress.
I’m opening this forum topic.

Question: Why is the Volume Used 0 MB?

Viewing the volumes for the authprodb, they have Used 0 MB i.e. no data
https://fly.io/apps/authprodb/volumes

image

Luckily, we have a working “staging” version of the auth App also hosted on Fly.io: https://authdemo.fly.dev/
and can easily compare the Volume: https://fly.io/apps/authdemo-db/volumes we see that the authdemo-db is using 182 MB the available volume:

This suggests that our data was in fact deleted when Fly.io disabled our App.

If anyone has seen this issue and can help us recover our Postgres data, please comment! :pray:

Hi there!

Can you check fly volumes list -a authprodb and see if volume information looks correct there? It might be a display issue with the web UI; perhaps some overzealous rounding-down. For what it’s worth, I checked in our admin panel and your volumes are there and do show some usage. As the email said, data was not deleted.

  • Daniel
1 Like

Good idea to sense-check via the CLI. :ok_hand:

fly volumes list -a authprodb

output:

ID                  	STATE  	NAME   	SIZE	REGION	ZONE	ENCRYPTED	ATTACHED VM	CREATED AT
vol_g2yxp4mllj6v63qd	created	pg_data	10GB	lhr   	34c3	false    	           	2 years ago
vol_1g67340ggznrydxw	created	pg_data	10GB	lhr   	805e	false    	           	2 years ago

Doesn’t actually tell us much as the USED column is not shown in the output. :man_shrugging:

Tried to use the flyctl volumes show command for either of the volumes:

fly volumes show vol_g2yxp4mllj6v63qd

but got the following error:

Error: failed retrieving volume: failed to get volume vol_g2yxp4mllj6v63qd: 
Volume not found (Request ID: 01HEB11ZCNK8NDBZ8QJAR4NXF4-mad)

Same for the other volume:

fly volumes show vol_1g67340ggznrydxw

Similar error:

Error: failed retrieving volume: failed to get volume vol_1g67340ggznrydxw: 
Volume not found (Request ID: 01HEB178VRQFVMK4YS1FXCZQXN-mad)

Really hope the data isn’t lost. But get a sinking feeling … :confused:

No reply from support@fly.io via email yet. :woman_shrugging:
We really just need to know if our data is lost or not so we can move on. :thought_balloon:

Hi there!

We added usage to volumes quite recently and we did so only for machines so since these were nomad volumes we didn’t update usage. Our volume list page should have been more clear but the single volume page mentions that.

Here’s our announcement:

3 Likes

@dwyl_auth support@fly.io has an autoresponder - if you didn’t see a reply please check your spam folder.

Can you try listing snapshots for a volume?

fly volume snapshots list vol_g2yxp4mllj6v63qd -a authprodb

If those are there, we can maybe try restoring from a snapshot.

  • Daniel
1 Like

Thanks for your replies. :pray:

Output:

Snapshots
ID                 	SIZE    	CREATED AT
vs_gnkGR2jXV7yM3sqN	96479939	7 hours ago
vs_45J85kvMkzN12HkQ	96479939	1 day ago
vs_xRML07wyPwaRKtky	96479939	2 days ago
vs_zOJvpv3XlwLVjuM0	96479939	3 days ago
vs_nO36m9OV5L76qSR 	96479939	4 days ago
vs_QGPN6pQObpLQpUma	96479939	5 days ago

Ran:

fly postgres create --fork-from authprodb:vol_g2yxp4mllj6v63qd

And that appeared to work.
But when attempting to view the actual Postgres data:

fly postgres connect -a authprodb2

List the databases:

\l
                                                List of databases
   Name    |  Owner   | Encoding |  Collate   |   Ctype    | ICU Locale | Locale Provider |   Access privileges
-----------+----------+----------+------------+------------+------------+-----------------+-----------------------
 postgres  | postgres | UTF8     | en_US.utf8 | en_US.utf8 |            | libc            |
 repmgr    | repmgr   | UTF8     | en_US.utf8 | en_US.utf8 |            | libc            |
 template0 | postgres | UTF8     | en_US.utf8 | en_US.utf8 |            | libc            | =c/postgres          +
           |          |          |            |            |            |                 | postgres=CTc/postgres
 template1 | postgres | UTF8     | en_US.utf8 | en_US.utf8 |            | libc            | =c/postgres          +
           |          |          |            |            |            |                 | postgres=CTc/postgres
(4 rows)

I’d expect to a DB named authprod in the table.
I’m confused. :man_shrugging:

For good measure I tried the same with the second volume:

fly postgres create --fork-from authprodb:vol_1g67340ggznrydxw

Output:

? Select VM size: shared-cpu-1x - CPU Kind: Shared, vCPUs: 1 Memory: 256MB
Creating postgres cluster in organization dwyl-auth-546
Creating app...
Setting secrets on app authprodb3...
Provisioning 1 of 1 machines with image flyio/postgres-flex:15.3@sha256:44b698752cf113110f2fa72443d7fe452b48228aafbb0d93045ef1e3282360a6
Waiting for machine to start...
Machine e82d377a069728 is created
==> Monitoring health checks
  Waiting for e82d377a069728 to become healthy (started, 3/3)

Postgres cluster authprodb3 created
  Username:    postgres
  Password:    redacted
  Hostname:    authprodb3.internal
  Flycast:     fdaa:0:42c6:0:1::7
  Proxy port:  5432
  Postgres port:  5433
  Connection string: postgres://postgres:redacted@authprodb3.flycast:5432

Save your credentials in a secure place -- you won't be able to see them again!

Connect to postgres
Any app within the DWYL Auth organization can connect to this Postgres using the above connection string

Now that you've set up Postgres, here's what you need to understand: https://fly.io/docs/postgres/getting-started/what-you-should-know/

Connect:

fly postgres connect -a authprodb3
\l
                                                List of databases
   Name    |  Owner   | Encoding |  Collate   |   Ctype    | ICU Locale | Locale Provider |   Access privileges
-----------+----------+----------+------------+------------+------------+-----------------+-----------------------
 postgres  | postgres | UTF8     | en_US.utf8 | en_US.utf8 |            | libc            |
 repmgr    | repmgr   | UTF8     | en_US.utf8 | en_US.utf8 |            | libc            |
 template0 | postgres | UTF8     | en_US.utf8 | en_US.utf8 |            | libc            | =c/postgres          +
           |          |          |            |            |            |                 | postgres=CTc/postgres
 template1 | postgres | UTF8     | en_US.utf8 | en_US.utf8 |            | libc            | =c/postgres          +
           |          |          |            |            |            |                 | postgres=CTc/postgres
(4 rows)

These are not the tables we are looking for …

Nice meme!

So instead of forking the volume, can you try restoring from a snapshot?

it’ll be very similar to forking but… from a snapshot :slight_smile:

This is one command, in one line. It will create a new app - it won’t touch the existing volumes or snapshots.

fly postgres create --name new-db --vm-size shared-cpu-2x --volume-size 10 --initial-cluster-size 1 --region CHOOSE --org your-org --image-ref registry-1.docker.io/flyio/postgres:13 --stolon --snapshot-id vs_yaddayaddayadda

One other question - this app does not appear to have been created with fly postgres, am I right?

1 Like

Hi @roadmr thanks for your reply. :pray:
(apologies for delay in commenting; didn’t get notification for this …)

Thank you for your suggestion to create a new app:

fly postgres create --name new-db --vm-size shared-cpu-2x --volume-size 10 --initial-cluster-size 1 --region CHOOSE --org your-org --image-ref registry-1.docker.io/flyio/postgres:13 --stolon --snapshot-id vs_yaddayaddayadda

I re-ran the command to check the snapshots list:

fly volume snapshots list vol_g2yxp4mllj6v63qd

That got me the list:

Snapshots
ID                 	SIZE    	CREATED AT
vs_vYXP8wx6ezM53CR 	96479939	11 hours ago
vs_vnAvN76g36Apwt2A	96479939	1 day ago
vs_azyMXoAexXOG9tA7	96479939	2 days ago
vs_JjO3bbl7nbV2oSX5	96479939	3 days ago
vs_VD9Zm3LRVqQ6lfz7	96479939	4 days ago
vs_gnkGR2jXV7yM3sqN	96479939	5 days ago

Using the latest snapshot: vs_vYXP8wx6ezM53CR executed the suggested command:

fly postgres create --name new-db --vm-size shared-cpu-2x --volume-size 10 --initial-cluster-size 1 --region lhr --org dwyl-auth-546 --image-ref registry-1.docker.io/flyio/postgres:13 --stolon --snapshot-id vs_vYXP8wx6ezM53CR
Creating app...
Setting secrets on app new-db...
Restoring 1 of 1 machines with image registry-1.docker.io/flyio/postgres:13
Waiting for machine to start...
Machine 9080577c1d2798 is created
==> Monitoring health checks
  Waiting for 9080577c1d2798 to become healthy (started, 0/3)
Error: context deadline exceeded

It errored twice. :cry:
But for some reason in the Web UI it still shows that it’s running:
https://fly.io/dashboard/dwyl-auth-546

Suggests that the new-db was created (even though the CLI reported “Error: context deadline exceeded”…)

If I attempt to connect to new-db:

fly postgres connect -a  new-db

Get the error:

Error: no active leader found

So guessing it’s not working. :man_shrugging:
Happy to run any other commands you suggest. :ok_hand:
Thanks again.

Still no progress. :hourglass_flowing_sand: :cry:
If anyone has experience with this please help. :pray:

If you’re willing to try more of a multi-step process, it should be possible to recover this via the old-school route: sftp → local pg → local pg_dumppg_restore (into a fresh Fly cluster).

If you’ve never run Postgres locally before, though, there’s a bit of a learning curve…

1 Like

@mayailurus Definitely willing to try anything as it will save us a huge headache. :cry:
If you can share a link to the steps very happy to try. :pray:

As far as I know, there isn’t a single list covering all these steps in one place, but I will try to rustle up subsets…

In the meantime, is your new-db machine still running? You can get a start on the first part by looking at volume configuration from the inside…

fly ssh console -a new-db
lsblk  # mount points
df -h  # sizes (GB used)

The idea will be to download the entire filesystem subtree corresponding to vdb into a single .zip file.

1 Like

The next step, assuming that the mount point that you saw was /data and that the amount used was relatively small, would be…

fly sftp shell -a new-db
cd /
get data
^D

(That last line is Ctrl+D, the end-of-file/end-of-stream character.)

At this point you should have a file named data.zip on your local machine.

Warning: this .zip file may contain passwords and SSH private keys, so handle it with care!

Thus, you would be in the situation of the following Stack Overflow post:

https://serverfault.com/questions/336817/how-to-restore-a-file-system-level-copy-of-a-postgresql-database-not-dump-to-a

That’s for Windows instead of local Linux, and you would need to install your own Postgres binaries instead of having them in the .zip file, but the overall concepts would be the same.


(The gotcha with version mismatches that the answer mentions really is important, by the way. That’s one of the few things that I dislike about Postgres.)

There may turn out to be some head-scratching over what exactly to use for -D… You may find that you have multiple Postgres clusters within that .zip file, the reason being that many Postgres containers have entrypoint magic that creates a completely new one if it doesn’t find exactly what it was looking for (like with obiwan, :sparkles:).

This would best be resolved by looking at the filesystem timestamps (back on new-db) and going with the oldest.

(The PG_VERSION files will tell you which exact version of Postgres you need to install, incidentally.)


Once you have things running in a local Postgres instance, you are mostly home free… Assuming, again, that this is a relatively small amount of data, it is said that it could be uploaded all in one fell swoop:

https://community.fly.io/t/how-copy-local-postgres-db-to-fly-io/13074/5

(If it was larger, then I would have doubts about how it would resume failed transfers, and the like.)

The -a target in this case would be a freshly created (by you) even-newer-db, or such, as alluded to above. You may want to attach even-newer-db to your application before beginning the import, to ensure that the expected roles are in place—and similar. (I admittedly haven’t investigated this aspect in any detail.)

Finally, a couple other wrinkles that you may or may not encounter:

https://community.fly.io/t/postgres-flex-database-postgres-has-a-collation-version-mismatch/14391

https://community.fly.io/t/how-to-retain-roles-owns-with-fly-pg-import/16156

Hope this helps!

1 Like

Hi @mayailurus thanks very much for the helpful comments. :pray:

Given the error above, I had deleted new-db to avoid incurring unnecessary charges.
So I had to re-run the command to re-create new-db:

fly postgres create --name new-db --vm-size shared-cpu-2x --volume-size 10 --initial-cluster-size 1 --region lhr --org dwyl-auth-546 --image-ref registry-1.docker.io/flyio/postgres:13 --stolon --snapshot-id vs_jXlNXQLxgLDeXclV

When I attempt to run the suggested set of commands:

fly sftp shell -a new-db
cd /
get data

Got the following time-out error:

/data/postgres/backup_manifest (191856 bytes)
/data/postgres/pg_logical/replorigin_checkpoint (8 bytes)
/data/postgres/pg_stat/db_16386.stat (14990 bytes)
/data/postgres/pg_stat/db_0.stat (1670 bytes)
/data/postgres/pg_stat/db_13757.stat (8330 bytes)
/data/postgres/pg_stat/global.stat (1592 bytes)
/data/postgres/postgresql.conf (822 bytes)
/data/postgres/pg_wal/0000003A00000000000000B1 (16777216 bytes)
get //data -> data.zip: write /data/postgres/pg_wal/0000003B00000000000000B7: connection lost (wrote 3244032 bytes)
get //data -> data.zip: stat /data/postgres/pg_wal/0000003A00000000000000B0: connection lost
get //data -> data.zip: stat /data/postgres/pg_wal/0000003B00000000000000B3: connection lost
get //data -> data.zip: stat /data/postgres/pg_wal/0000003B.history: connection lost
get //data -> data.zip: stat /data/postgres/pg_wal/0000003A.history: connection lost
get //data -> data.zip: stat /data/postgres/pg_wal/0000003B00000000000000B6: connection lost
get //data -> data.zip: stat /data/postgres/pg_wal/archive_status: connection lost
get //data -> data.zip: walk: connection lost

We have a very good internet connection.
There’s no reason for connection to be “lost” … :cry:

Will try again tomorrow morning. :crossed_fingers:

1 Like

Same problem, seems all the volume snapshot are lost, it shows all at a same size.

I tried a lot of times to create a new volume and restore the snapshot, but the mount directory is empty。

This is such a disaster. I want to recover my data. Any staff can help us? :cry:

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.