Point-in-time backups using Postgres and wal-g

qqwy · September 3, 2022, 8:07am

Wal-g is a tool which allows to do ‘streaming backups’ with Postgres. This means that rather than having daily or hourly backups where some data gets lost if disaster ever happened, you will lose nothing to a few seconds at most.
Related to this, wal-g allows you to restore from such a backup using a ‘point-in-time restore’ (also known as ‘PITR’), allowing you to restore to the exact moment just before any database problems started.

Especially for medium to large databases in which taking logical backups by running pg_dump/pg_load starts to take so long that it interrupts the normal workings of the app, or in which it is important that as little as data gets lost in the case of a failure, streaming backups and PITR can be a life-saver.

Recently a PR I made to add wal-g to Fly.io’s Postgres image was merged. This means that you can now make easy use of wal-g from within your Fly’s Postgres images!

WAL-based PITR DB backups on Fly.io

I do need to disclaim that I am still fairly new to streaming backups, wal-g and stolon (the tool Fly uses to run Postgres in a highly-available cluster), so while I have written up a little guide which seems to work in the testing I have done so far, there might be things that are suboptimal. Definitely thoroughly test this yourself if you use it!

Setup

1. Ensure you are running a Fly Postgres cluster

If you do not have a cluster yet, create one using fly pg create. write down the connection settings because you’ll immediately need them!

If you already were running a cluster, run fly image update --app=your-appname-db to update your DB cluster. (This will restart your DB). wal-g was added as part of image release v0.0.26.

I strongly recommend testing this out first on a new DB, and only once you understand how it works, test it on your ‘real’ DB.

1b. Pull down your DB-app’s fly.toml

Afterwards, make sure that you download the DB-app’s fly.toml because we need to make some changes. I recommend doing this in a separate git repository, or in a subfolder of your main app’s git repo. This folder will be called your ‘DB-app directory’ in the next steps. Run fly config save --app=your-appname-db.

1c. Scale down your cluster to a single running machine

fly scale count 1
This will make interacting with the cluster easier when we are making changes to the Stolon configuration later on.
Once we’re done with the setup we’ll scale it back up.

2. Setup app settings

From within the ‘DB-app’ directory, add the following ‘general’ env variables to the [env] section in the fly.toml

ENABLE_WALG = "true"
PRIMARY_REGION = "ams" # Choose region you envision being the primary place for your DB for this
PGHOST = "localhost"
PGPORT = "5432"
PGUSER = "postgres"

As well as the ‘specific’ settings for the desired backup storage option
https://github.com/wal-g/wal-g/blob/master/docs/STORAGES.md

Of course, keep passwords out of your fly.toml, and set them as a secret (as part of the next step).

3. Add secrets for Postgres, Stolon and Wal-g to be able to read/write to the database.

Generate secure passwords for these, and write them down in a secure place (e.g. a password manager)

OPERATOR_PASSWORD and PGPASSWORD need to be the same.

BEWARE: If you’re using an existing cluster, SU_PASSWORD REPL_PASSWORD and OPERATOR_PASSWORD are probably already set. If so, you don’t want to overwrite them.

fly secrets set SU_PASSWORD=correcthorse REPL_PASSWORD=batterystaple OPERATOR_PASSWORD=supersecret PGPASSWORD=supersecret

NOTE: Store all of these passwords in a safe place! You will need them again during a restore!

In this step, also set any backup storage-specific passwords.

4. Deploy your changes to the `fly.toml` and secrets.

fly deploy
fly status

5. Update postgres configuration settings to ensure walg-backups are enabled

Now we need to make a small change to the Postgres configuration that is managed by Stolon.
This configuration is stored on the database ‘volume’, and synched throughout your database cluster by stolon, so you only need to make this change on one of your DB machines, and it will persist across restarts.

This change will tell Stolon to enable streaming backups, and call wal-g’s ‘wal-push’ command every 60 seconds.

fly ssh console # in your DB-app diirectory
export $(cat /data/.env | xargs)
stolonctl update --patch '{"pgParameters": { "archive_mode": "yes", "archive_timeout": "60", "archive_command": "wal-g wal-push %p"}}'

You’ll be able to see in the logs (fly logs ) whether everything is working as intended or whether configuration needs changing.

6. Scale up

If it’s a new cluster

Add a replica in the same region (the ‘highly-available’ replica)

fly volumes create pg_data --region ams --size 10
fly scale count 2
fly status

and then add a replica in another region (a ‘read-replica’)

fly volumes create pg_data --region fra --size 10
fly scale count 3
fly status

If it’s an existing cluster

Just a simple fly scale count 3 or whatever your original count was should do.

8. If it’s a new cluster: Fill the database with some data

You probably know how to do this already .

9. Make your first ‘traditional’ backup

NOTE: The indicated path /data/postgres is the same for all VMs that use Fly’s Postgres image

fly ssh console
wal-g backup-push /data/postgres

WARNING: You probably want to set up a cronjob to perform ‘traditional’ backups backups regularly, to allow point-in-time restores to be faster (from the lastest ‘traditional’ backup rather than from this initial one). This is not covered in this guide.
(You can do so by re-running the same command in the future. Either manually, or by e.g. adding a cronjob. Adding a cronjob inside the Fly Postgres image requires changing its Dockerfile, unfortunately. Alternatively you could run a separate Fly app that is able to connect to this DB and use the same wal-g configuration settings that contains such a cronjob, of course. There is definite room for improvement here.)

10. Check whether backups are working correctly

fly ssh console
wal-g backup-list
wal-g wal-show
wal-g wal-verify integrity
wal-g wal-verify timeline

Besides this check, it is highly recommended to periodically test whether your ‘restoring from the backups’ flow is still working correctly.

Restoring from the backups

1. Preparation: scale cluster down to a single postgres instance.

Alternatively, create a new cluster using the steps above, to restore in (steps 1-4).

fly scale count 1

2. Re-init Stolon in ‘pitr’ mode.

If you want to restore the ‘latest’ backup:

fly ssh console
export $(cat /data/.env | xargs)
stolonctl init '{ "initMode": "pitr", "pitrConfig": {"dataRestoreCommand": "wal-g backup-fetch %d LATEST" , "archiveRecoverySettings": { "restoreCommand": "wal-g wal-fetch \"%f\" \"%p\"" } } }'

If you want to restore a particular point in time:

Replace the recoveryTargetTime with the desired time in the command below.
Note that postgres only accepts this particular format of ISO8601 timestamps, and only for UTC+00.

fly ssh console
export $(cat /data/.env | xargs)
stolonctl init '{ "initMode": "pitr", "pitrConfig": { "recoveryTargetSettings": {"recoveryTargetTime": "2022-05-07 20:34:02.998+00"}, "dataRestoreCommand": "wal-g backup-fetch %d LATEST" , "archiveRecoverySettings": { "restoreCommand": "wal-g wal-fetch \"%f\" \"%p\"" } } }'

3. Keep track of reinitialization

using fly logs and other metrics at your disposal.

If it seems that Stolon is stuck in a ‘boot loop’ after the call to stolonctl init, kill the ‘stolon-keeper’ OS process and try from step 2 once more.
(ps aux | grep stolon will tell you its PID, and then kill PID to kill it.)

4. Do not forget to scale up again if backup is successful

fly scale count 3

And, if you created a new cluster, now enable backups for this new cluster (step 5 of cluster creation).

Congrats, you’re done!

Let me know how you fare.
Feedback on these steps are of course very welcome.

danielmorrison · September 7, 2022, 7:55pm

qqwy:

3. Add secrets for Postgres, Stolon and Wal-g to be able to read/write to the database.

Generate secure passwords for these, and write them down in a secure place (e.g. a password manager)

OPERATOR_PASSWORD and PGPASSWORD need to be the same.
fly secrets set SU_PASSWORD=correcthorse REPL_PASSWORD=batterystaple OPERATOR_PASSWORD=supersecret PGPASSWORD=supersecret
In this step, also set any backup storage-specific passwords.

One suggested change: If you’re using an existing cluster, SU_PASSWORD REPL_PASSWORD OPERATOR_PASSWORD are already set, and you don’t want to overwrite them.

Otherwise, this is great and I’m thrilled at how easy it was. Thank you!

qqwy · September 7, 2022, 8:08pm

I’ve changed it in the guide. Thanks for the tip!

You’re very welcome.

LeoAdamek · September 14, 2022, 1:35pm

This is great, but for some reason the initial backup with wal-g backup-push $PGDATA fails with:

Connect: postgres connection failed: unexpected message type

The keeper logs show the following:

FATAL: password authentication failed for user “postgres”

I tried export PGPASS=$SU_PASSWORD but that didn’t seem to help. Any ideas on how to take an initial snapshot?

Strangely enough the WAL backups are working just fine!

shaun · September 14, 2022, 1:48pm

I tried export PGPASS=$SU_PASSWORD but that didn’t seem to help. Any ideas on how to take an initial snapshot?

Try prepending it to your command.

E.G.
PG_PASS=$SU_PASSWORD wal-g backup-push /data/postgres

LeoAdamek · September 14, 2022, 1:50pm

No Joy, but it shouldn’t need the password as far as I can tell as it’s designed to operate on the $PGDATA directory directly, so not sure why it’s trying to connect to pg?

LeoAdamek · September 14, 2022, 1:56pm

I figured it out, it’s the OPERATOR_PASSWORD that’s required, not the SU_PASSWORD

makris.spiros · September 18, 2022, 8:10am

Thank you very much for this @qqwy !!

Newbie question; Does it only work for replicated clusters or can I use it with a single postgres node?

qqwy · September 18, 2022, 11:08am

The backup process is separate from replication (both use the WAL, the write-ahead log, as data source, but they do not depend on one another at all.) You can use in exactly the same way with a single postgres node.

LeoAdamek · September 23, 2022, 12:52pm

After the PITR has finished running, does the cluster need to be re-initialized from /fly/cluster-spec.json – seems that after the recovery is done we get errors about flypgadmin not being able to auth, and any vm restarts causes the PITR to be restarted.

LeoAdamek · September 23, 2022, 1:17pm

2022-09-23T13:15:28Z app[94abeaf7] fra [info]keeper   | 2022-09-23 13:15:28.436 UTC [872] LOG:  selected new timeline ID: 3
2022-09-23T13:15:28Z app[94abeaf7] fra [info]keeper   | ERROR: 2022/09/23 13:15:28.457109 Archive '0000000200000034000000AB' does not exist.
2022-09-23T13:15:28Z app[94abeaf7] fra [info]keeper   |
2022-09-23T13:15:28Z app[94abeaf7] fra [info]keeper   | 2022-09-23 13:15:28.467 UTC [872] LOG:  archive recovery complete
2022-09-23T13:15:28Z app[94abeaf7] fra [info]keeper   | ERROR: 2022/09/23 13:15:28.764649 Archive '00000002.history' does not exist.
2022-09-23T13:15:28Z app[94abeaf7] fra [info]keeper   | 2022-09-23 13:15:28.886 UTC [871] LOG:  database system is ready to accept connections
2022-09-23T13:15:29Z app[94abeaf7] fra [info]checking stolon status
2022-09-23T13:15:29Z app[94abeaf7] fra [info]keeper   | 2022-09-23 13:15:29.393 UTC [1685] FATAL:  password authentication failed for user "flypgadmin"
2022-09-23T13:15:29Z app[94abeaf7] fra [info]keeper   | 2022-09-23 13:15:29.393 UTC [1685] DETAIL:  Connection matched pg_hba.conf line 1: "local postgres flypgadmin md5"
2022-09-23T13:15:29Z app[94abeaf7] fra [info]keeper   | 2022-09-23 13:15:29.599 UTC [1686] FATAL:  password authentication failed for user "flypgadmin"
2022-09-23T13:15:29Z app[94abeaf7] fra [info]keeper   | 2022-09-23 13:15:29.599 UTC [1686] DETAIL:  Connection matched pg_hba.conf line 1: "local postgres flypgadmin md5"
2022-09-23T13:15:29Z app[94abeaf7] fra [info]keeper   | 2022-09-23 13:15:29.806 UTC [1687] FATAL:  password authentication failed for user "flypgadmin"
2022-09-23T13:15:29Z app[94abeaf7] fra [info]keeper   | 2022-09-23 13:15:29.806 UTC [1687] DETAIL:  Connection matched pg_hba.conf line 1: "local postgres flypgadmin md5"
2022-09-23T13:15:29Z app[94abeaf7] fra [info]keeper   | 2022-09-23 13:15:29.876 UTC [1688] FATAL:  password authentication failed for user "flypgadmin"
2022-09-23T13:15:29Z app[94abeaf7] fra [info]keeper   | 2022-09-23 13:15:29.876 UTC [1688] DETAIL:  Connection matched pg_hba.conf line 1: "local postgres flypgadmin md5"

Wondering what’s going on here, do some passwords need changing… psql doesn’t work and says the server closed the connection.

The error about flypgadmin repeats, a lot, pretty much as fast as my terminal can scroll…

LeoAdamek · September 23, 2022, 1:48pm

Right, I’ve managed to connect to postgres directly via port 5433 (as haproxy still has the server as unhealthy, presumably because flypgadmin can’t connect) – Which password do I need to set for flypgadmin (e.gg. OPERATOR_PASSWORD, SU_PASSWORD or REPL_PASSWORD) ?

Answer: It’s SU_PASSWORD

juliancarrivick · September 26, 2022, 4:27am

This is fantastic @qqwy! Thanks very much for the guide

If, like me, you had created a cluster using fly pg create, you can deploy the new [env] values using fly deploy -i flyio/postgres:14.4. Otherwise I got Error failed to fetch an image or build from source: app does not have a Dockerfile or buildpacks configured. See https://fly.io/docs/reference/configuration/#the-build-section. I used the version specified in the postgres-ha Dockerfile.

I also had to use PGPASSWORD, NOT PG_PASS or PGPASS (maybe depends on PG version, I’m using 14). Edit: I missed in the OP that PGPASSWORD should have been set as a secret - I skipped this because the new cluster had the other secrets already set and it gets cut off by the scroll bar.

juliancarrivick · September 26, 2022, 6:38am

If it helps anyone, there are some additional steps required when restoring to a new cluster. After executing the stolonctl init command (also doing all the other previous setup on the new cluster) you will get constant errors in the logs:

FATAL: password authentication failed for user "flypgadmin"

To fix, you need to update the SU_PASSWORD, REPL_PASSWORD and OPERATOR_PASSWORD secret values in the new cluster to match the old cluster. However it looks like the only way to get SU_PASSWORD and REPL_PASSWORD is to ssh into an instance of the old cluster and manually extract them (OPERATOR_PASSWORD is provided when first creating a cluster). So make sure to store them somewhere accessible!

After updating the secrets I can query the new cluster and see all the previous data!

Edit: Oops, this information is basically a duplicate of what @LeoAdamek has written above

shaun · September 26, 2022, 1:57pm

There is a little trick you can use to sync your credentials when restoring into a new cluster. If you set the FLY_RESTORED_FROM environment value to a random value, it will boot Postgres as if it were performing a restore from snapshot. This process should automatically reconfigure the flypgadmin, repl_user and postgres user credentials to match the credentials defined within the environment.

paulrudy · November 16, 2022, 6:59pm

I end up with the fly app dashboard showing Failed due to unhealthy allocations - no stable job version to auto revert to.

fly status shows the postgres app to be in an error role. But wal-g does seem to be functioning, and I can fly ssh console successfully. Are these errors normal for now?

qqwy · December 8, 2022, 10:55pm

I have tried replicating these steps on a new Postgres v2 (machine-powered) cluster, to be able to update the guide with the changes required when working with machines.

However, it currently does not work. When adding the PGUSER and PGPASSWORD environment variables that are necessary to work with wal-g, the machine-based cluster starts misbehaving. Specifically, adding new nodes in the cluster no longer is possible.
I have added an issue to the Fly postgres-ha repo for this.

qqwy · January 25, 2023, 7:27pm

After a hectic Christmas time, I finally had some time now to revisit this.
Here is a complete guide to backup-and-restore on V2 of Fly’s architecture.

Steps to set up wal-g based point-in-time backups for Fly’s V2 postgres architecture

As starting point, this guide assumes you have a Fly postgres cluster running.

Set up env vars

Add the following environment variables (as secrets; we do not have a separate fly.toml for the DB app after all).

First write them out in a separate temporary file, so you can apply them all at once later (see below).

Preparing general settings:

Make a note of the $REPL_PASSWORD value inside the cluster (fly ssh --app=your-db-app-name and then printenv | sort)
Fill it in instead of "REPL_PASSWORD_HERE":

 ENABLE_WALG = "true"
 PGHOST = "localhost"
 PGPORT = "5432"
 PGUSER = "repluser"
 PGPASSWORD = "REPL_PASSWORD_HERE"

Preparing storage specific settings

Depending on what backup storage you use, you’ll need specific extra connection settings.
Details: wal-g/STORAGES.md at master · wal-g/wal-g · GitHub

As an example, for backing up to AWS S3 you’ll need:

AWS_ACCESS_KEY_ID="123451234512345"
AWS_SECRET_ACCESS_KEY="YOUR_SUPER_SECRET_TOKEN"
WALG_S3_PREFIX="s3://name-of-your-s3-bucket/some_folder"

Applying these settings

I recommend first making a file containing all FOO="bar" as separate lines,
and when happy add a \ at the end of every line (but the last).
You can then copy-paste this after fly secrets set --app=your-db-app-name, e.g.:

fly secrets set --app=your-db-app-name \
ENABLE_WALG="true" \
PGHOST="localhost" \
PGPORT="5432" \
...
WALG_S3_PREFIX="s3://foo/bar/baz"

and all secrets will be applied at once (and your DB cluster will only restart once, so downtime should be minimal).

Be sure to run fly status --watch --app=your-db-app-name
in a separate terminal window to see if the deploy is going well.

And of course keep an eye on the logs.
If all went well, ssh to your database cluster
and look at the output of stolonctl status.
It prints a table of all machines in your DB cluster. Each row should be ‘PG healthy’.

I recommend you save the file with all these settings for later (potentially with all passwords/secrets anonymized) so you will now how to set up a new cluster with exactly the same settings. This is important to swiftly restore from a backup.

Perform your first manual backup

Now to perform your first manual backup:

PGUSER=postgres PGPASSWORD=$OPERATOR_PASSWORD wal-g backup-push /data/postgres

If there are any connection errors, (because e.g. you misspelled a password token in one of the secrets), you’ll find out now.

If all goes well, you’ll see something along the lines of ‘finished writing part X’.
You can double-check whether the backup went well by going to your backup storage location (e.g. the AWS S3 bucket) and looking whether it now contains a folder called basebackups_005 and a folder called wal_005.

As extra checks, run any of the following:

PGUSER=postgres PGPASSWORD=$OPERATOR_PASSWORD wal-g backup-list
PGUSER=postgres PGPASSWORD=$OPERATOR_PASSWORD wal-g wal-show
PGUSER=postgres PGPASSWORD=$OPERATOR_PASSWORD wal-g wal-verify integrity
PGUSER=postgres PGPASSWORD=$OPERATOR_PASSWORD wal-g wal-verify timeline

I recommend setting up a periodic task (‘cronjob’) from inside one of your app servers (or possibly a separate tiny Fly VM) to run this wal-g backup-push command every so often (like every few days): The more recent your base backup snapshot, the faster the backup-restoration process will be.

Performing a restore

Create restore cluster

First, make a new cluster. Start with just a single node, as this way there will not be race conditions where some nodes still try to replicate the old DB while you reset the cluster.
The easiest way to do this, is to:

fly pg create
fill in app name
select ‘custom configuration’
1 node
other settings to whatever you fancy.
I recommend using the same settings as your original cluster.
Note that the shared CPU/256MB does not have enough ram to perform the restore process at least in my testing. (You can always change it back down to this setting later if you really want.)

Set up env vars

Now, add secrets to this new cluster. This is similar to the step above (and I recommend re-using the file you saved earlier for this), except:

Do not add the ENABLE_WALG line!
Add values for SU_PASSWORD, OPERATOR_PASSWORD and REPL_PASSWORD so they match the values in the original cluster.
Add FLY_RESTORED_FROM="thevaluedoesnotmatter". This will ensure that at the next restart, Fly will update the passwords to

Check if this deploy goes well using fly status and fly logs.

Execute restore

Now, scale down the cluster so you only have 1 instance running.
If you are running a ‘dev mode’ cluster this is already the case.
Otherwise, run fly machine list --app=your-new-cluster-name
followed by a fly machine stop --app=your-new-cluster-name machine_id for each but one of the machines.

This will make sure that the replication process does not interfere with the restore (which otherwise can cause annoying race conditions).

Now in a fly ssh --app=your-new-cluster-name terminal window, run:

If you want to restore to the ‘latest’ backup:

export $(cat /data/.env | xargs)
stolonctl init --yes '{ "initMode": "pitr", "pitrConfig": {"dataRestoreCommand": "wal-g backup-fetch %d LATEST" , "archiveRecoverySettings": { "restoreCommand": "wal-g wal-fetch \"%f\" \"%p\"" } } }'

If you instead want to restore a particular point in time:

Replace the recoveryTargetTime with the desired time in the command below.
Note that postgres only accepts this particular format of ISO8601 timestamps, and only with the timezone part set to UTC+00.

export $(cat /data/.env | xargs)
stolonctl init '{ "initMode": "pitr", "pitrConfig": { "recoveryTargetSettings": {"recoveryTargetTime": "2022-05-07 20:34:02.998+00"}, "dataRestoreCommand": "wal-g backup-fetch %d LATEST" , "archiveRecoverySettings": { "restoreCommand": "wal-g wal-fetch \"%f\" \"%p\"" } } }'

Check whether the restore goes according to plan by looking at fly logs.

Make restore cluster your new ‘real’ cluster

If you are happy with the restore, there are only three small things left to do:

If your old cluster is still operational, stop its backup streaming by running fly secrets unset ENABLE_WALG --app=your-old-cluster-name. Or feel free to stop/remove the cluster alltogether of course.
fly secrets set ENABLE_WALG="true" --app=your-new-cluster-name. From this point onward, the new cluster will start writing to the backup store.
Scale up the cluster to more than 1 node for high-availability: fly machines clone --app=your-new-cluster-name machine_id_of_first_machine
Finally, change the DATABASE_URL setting that you use inside your app(s) to point to the new cluster.

Topic		Replies	Views
Fresh Produce - Enhanced WAL Archiving and Remote Restores Fresh Produce postgres	30	2287	September 5, 2024
Early look: PostgreSQL on Fly. We want your opinions.	144	12053	February 19, 2022
Point in Time Recovery for Postgres Flex using Barman Fresh Produce postgres	24	4395	April 17, 2024
anatomy of a postgres outage Questions / Help postgres	5	1026	November 8, 2022
Unable to recover Postgres backup	5	611	September 13, 2023

Point-in-time backups using Postgres and wal-g

WAL-based PITR DB backups on Fly.io

Setup

1. Ensure you are running a Fly Postgres cluster

1b. Pull down your DB-app’s fly.toml

1c. Scale down your cluster to a single running machine

2. Setup app settings

3. Add secrets for Postgres, Stolon and Wal-g to be able to read/write to the database.

4. Deploy your changes to the fly.toml and secrets.

5. Update postgres configuration settings to ensure walg-backups are enabled

6. Scale up

If it’s a new cluster

If it’s an existing cluster

8. If it’s a new cluster: Fill the database with some data

9. Make your first ‘traditional’ backup

10. Check whether backups are working correctly

Restoring from the backups

1. Preparation: scale cluster down to a single postgres instance.

2. Re-init Stolon in ‘pitr’ mode.

If you want to restore the ‘latest’ backup:

If you want to restore a particular point in time:

3. Keep track of reinitialization

4. Do not forget to scale up again if backup is successful

Steps to set up wal-g based point-in-time backups for Fly’s V2 postgres architecture

Set up env vars

Preparing general settings:

Preparing storage specific settings

Applying these settings

Perform your first manual backup

Performing a restore

Create restore cluster

Set up env vars

Execute restore

If you want to restore to the ‘latest’ backup:

If you instead want to restore a particular point in time:

Make restore cluster your new ‘real’ cluster

Related topics

4. Deploy your changes to the `fly.toml` and secrets.