Point-in-time backups using Postgres and wal-g

Wal-g is a tool which allows to do ‘streaming backups’ with Postgres. This means that rather than having daily or hourly backups where some data gets lost if disaster ever happened, you will lose nothing to a few seconds at most.
Related to this, wal-g allows you to restore from such a backup using a ‘point-in-time restore’ (also known as ‘PITR’), allowing you to restore to the exact moment just before any database problems started.

Especially for medium to large databases in which taking logical backups by running pg_dump/pg_load starts to take so long that it interrupts the normal workings of the app, or in which it is important that as little as data gets lost in the case of a failure, streaming backups and PITR can be a life-saver.

Recently a PR I made to add wal-g to Fly.io’s Postgres image was merged. This means that you can now make easy use of wal-g from within your Fly’s Postgres images! :partying_face:


WAL-based PITR DB backups on Fly.io

I do need to disclaim that I am still fairly new to streaming backups, wal-g and stolon (the tool Fly uses to run Postgres in a highly-available cluster), so while I have written up a little guide which seems to work in the testing I have done so far, there might be things that are suboptimal. Definitely thoroughly test this yourself if you use it!

Setup

1. Ensure you are running a Fly Postgres cluster

If you do not have a cluster yet, create one using fly pg create. write down the connection settings because you’ll immediately need them!

If you already were running a cluster, run fly image update --app=your-appname-db to update your DB cluster. (This will restart your DB). wal-g was added as part of image release v0.0.26.

I strongly recommend testing this out first on a new DB, and only once you understand how it works, test it on your ‘real’ DB.

1b. Pull down your DB-app’s fly.toml

Afterwards, make sure that you download the DB-app’s fly.toml because we need to make some changes. I recommend doing this in a separate git repository, or in a subfolder of your main app’s git repo. This folder will be called your ‘DB-app directory’ in the next steps. Run fly config save --app=your-appname-db.

1c. Scale down your cluster to a single running machine

fly scale count 1
This will make interacting with the cluster easier when we are making changes to the Stolon configuration later on.
Once we’re done with the setup we’ll scale it back up.

2. Setup app settings

From within the ‘DB-app’ directory, add the following ‘general’ env variables to the [env] section in the fly.toml

ENABLE_WALG = "true"
PRIMARY_REGION = "ams" # Choose region you envision being the primary place for your DB for this
PGHOST = "localhost"
PGPORT = "5432"
PGUSER = "postgres"

Of course, keep passwords out of your fly.toml, and set them as a secret (as part of the next step).

3. Add secrets for Postgres, Stolon and Wal-g to be able to read/write to the database.

Generate secure passwords for these, and write them down in a secure place (e.g. a password manager)

OPERATOR_PASSWORD and PGPASSWORD need to be the same.

BEWARE: If you’re using an existing cluster, SU_PASSWORD REPL_PASSWORD and OPERATOR_PASSWORD are probably already set. If so, you don’t want to overwrite them.

fly secrets set SU_PASSWORD=correcthorse REPL_PASSWORD=batterystaple OPERATOR_PASSWORD=supersecret PGPASSWORD=supersecret

NOTE: Store all of these passwords in a safe place! You will need them again during a restore!

In this step, also set any backup storage-specific passwords.

4. Deploy your changes to the fly.toml and secrets.

fly deploy
fly status

5. Update postgres configuration settings to ensure walg-backups are enabled

Now we need to make a small change to the Postgres configuration that is managed by Stolon.
This configuration is stored on the database ‘volume’, and synched throughout your database cluster by stolon, so you only need to make this change on one of your DB machines, and it will persist across restarts.

This change will tell Stolon to enable streaming backups, and call wal-g’s ‘wal-push’ command every 60 seconds.

fly ssh console # in your DB-app diirectory
export $(cat /data/.env | xargs)
stolonctl update --patch '{"pgParameters": { "archive_mode": "yes", "archive_timeout": "60", "archive_command": "wal-g wal-push %p"}}'

You’ll be able to see in the logs (fly logs ) whether everything is working as intended or whether configuration needs changing.

6. Scale up

If it’s a new cluster

Add a replica in the same region (the ‘highly-available’ replica)

fly volumes create pg_data --region ams --size 10
fly scale count 2
fly status

and then add a replica in another region (a ‘read-replica’)

fly volumes create pg_data --region fra --size 10
fly scale count 3
fly status

If it’s an existing cluster

Just a simple fly scale count 3 or whatever your original count was should do.

8. If it’s a new cluster: Fill the database with some data

You probably know how to do this already :stuck_out_tongue_winking_eye:.

9. Make your first ‘traditional’ backup

NOTE: The indicated path /data/postgres is the same for all VMs that use Fly’s Postgres image

fly ssh console
wal-g backup-push /data/postgres

WARNING: You probably want to set up a cronjob to perform ‘traditional’ backups backups regularly, to allow point-in-time restores to be faster (from the lastest ‘traditional’ backup rather than from this initial one). This is not covered in this guide.
(You can do so by re-running the same command in the future. Either manually, or by e.g. adding a cronjob. Adding a cronjob inside the Fly Postgres image requires changing its Dockerfile, unfortunately. Alternatively you could run a separate Fly app that is able to connect to this DB and use the same wal-g configuration settings that contains such a cronjob, of course. There is definite room for improvement here.)

10. Check whether backups are working correctly

fly ssh console
wal-g backup-list
wal-g wal-show
wal-g wal-verify integrity
wal-g wal-verify timeline

Besides this check, it is highly recommended to periodically test whether your ‘restoring from the backups’ flow is still working correctly.

Restoring from the backups

1. Preparation: scale cluster down to a single postgres instance.

Alternatively, create a new cluster using the steps above, to restore in (steps 1-4).

fly scale count 1

2. Re-init Stolon in ‘pitr’ mode.

If you want to restore the ‘latest’ backup:

fly ssh console
export $(cat /data/.env | xargs)
stolonctl init '{ "initMode": "pitr", "pitrConfig": {"dataRestoreCommand": "wal-g backup-fetch %d LATEST" , "archiveRecoverySettings": { "restoreCommand": "wal-g wal-fetch \"%f\" \"%p\"" } } }'

If you want to restore a particular point in time:

  • Replace the recoveryTargetTime with the desired time in the command below.
    Note that postgres only accepts this particular format of ISO8601 timestamps, and only for UTC+00.
fly ssh console
export $(cat /data/.env | xargs)
stolonctl init '{ "initMode": "pitr", "pitrConfig": { "recoveryTargetSettings": {"recoveryTargetTime": "2022-05-07 20:34:02.998+00"}, "dataRestoreCommand": "wal-g backup-fetch %d LATEST" , "archiveRecoverySettings": { "restoreCommand": "wal-g wal-fetch \"%f\" \"%p\"" } } }'

3. Keep track of reinitialization

using fly logs and other metrics at your disposal.

If it seems that Stolon is stuck in a ‘boot loop’ after the call to stolonctl init, kill the ‘stolon-keeper’ OS process and try from step 2 once more.
(ps aux | grep stolon will tell you its PID, and then kill PID to kill it.)

4. Do not forget to scale up again if backup is successful

fly scale count 3

And, if you created a new cluster, now enable backups for this new cluster (step 5 of cluster creation).


Congrats, you’re done!

Let me know how you fare.
Feedback on these steps are of course very welcome.

16 Likes

One suggested change: If you’re using an existing cluster, SU_PASSWORD REPL_PASSWORD OPERATOR_PASSWORD are already set, and you don’t want to overwrite them.

Otherwise, this is great and I’m thrilled at how easy it was. Thank you!

1 Like

I’ve changed it in the guide. Thanks for the tip! :+1:

You’re very welcome. :blush:

This is great, but for some reason the initial backup with wal-g backup-push $PGDATA fails with:

Connect: postgres connection failed: unexpected message type

The keeper logs show the following:

FATAL: password authentication failed for user “postgres”

I tried export PGPASS=$SU_PASSWORD but that didn’t seem to help. Any ideas on how to take an initial snapshot?

Strangely enough the WAL backups are working just fine!

I tried export PGPASS=$SU_PASSWORD but that didn’t seem to help. Any ideas on how to take an initial snapshot?

Try prepending it to your command.

E.G.
PG_PASS=$SU_PASSWORD wal-g backup-push /data/postgres

No Joy, but it shouldn’t need the password as far as I can tell as it’s designed to operate on the $PGDATA directory directly, so not sure why it’s trying to connect to pg?

I figured it out, it’s the OPERATOR_PASSWORD that’s required, not the SU_PASSWORD

3 Likes

Thank you very much for this @qqwy !!

Newbie question; Does it only work for replicated clusters or can I use it with a single postgres node?

The backup process is separate from replication (both use the WAL, the write-ahead log, as data source, but they do not depend on one another at all.) You can use in exactly the same way with a single postgres node.

2 Likes

After the PITR has finished running, does the cluster need to be re-initialized from /fly/cluster-spec.json – seems that after the recovery is done we get errors about flypgadmin not being able to auth, and any vm restarts causes the PITR to be restarted.

2022-09-23T13:15:28Z app[94abeaf7] fra [info]keeper   | 2022-09-23 13:15:28.436 UTC [872] LOG:  selected new timeline ID: 3
2022-09-23T13:15:28Z app[94abeaf7] fra [info]keeper   | ERROR: 2022/09/23 13:15:28.457109 Archive '0000000200000034000000AB' does not exist.
2022-09-23T13:15:28Z app[94abeaf7] fra [info]keeper   |
2022-09-23T13:15:28Z app[94abeaf7] fra [info]keeper   | 2022-09-23 13:15:28.467 UTC [872] LOG:  archive recovery complete
2022-09-23T13:15:28Z app[94abeaf7] fra [info]keeper   | ERROR: 2022/09/23 13:15:28.764649 Archive '00000002.history' does not exist.
2022-09-23T13:15:28Z app[94abeaf7] fra [info]keeper   | 2022-09-23 13:15:28.886 UTC [871] LOG:  database system is ready to accept connections
2022-09-23T13:15:29Z app[94abeaf7] fra [info]checking stolon status
2022-09-23T13:15:29Z app[94abeaf7] fra [info]keeper   | 2022-09-23 13:15:29.393 UTC [1685] FATAL:  password authentication failed for user "flypgadmin"
2022-09-23T13:15:29Z app[94abeaf7] fra [info]keeper   | 2022-09-23 13:15:29.393 UTC [1685] DETAIL:  Connection matched pg_hba.conf line 1: "local postgres flypgadmin md5"
2022-09-23T13:15:29Z app[94abeaf7] fra [info]keeper   | 2022-09-23 13:15:29.599 UTC [1686] FATAL:  password authentication failed for user "flypgadmin"
2022-09-23T13:15:29Z app[94abeaf7] fra [info]keeper   | 2022-09-23 13:15:29.599 UTC [1686] DETAIL:  Connection matched pg_hba.conf line 1: "local postgres flypgadmin md5"
2022-09-23T13:15:29Z app[94abeaf7] fra [info]keeper   | 2022-09-23 13:15:29.806 UTC [1687] FATAL:  password authentication failed for user "flypgadmin"
2022-09-23T13:15:29Z app[94abeaf7] fra [info]keeper   | 2022-09-23 13:15:29.806 UTC [1687] DETAIL:  Connection matched pg_hba.conf line 1: "local postgres flypgadmin md5"
2022-09-23T13:15:29Z app[94abeaf7] fra [info]keeper   | 2022-09-23 13:15:29.876 UTC [1688] FATAL:  password authentication failed for user "flypgadmin"
2022-09-23T13:15:29Z app[94abeaf7] fra [info]keeper   | 2022-09-23 13:15:29.876 UTC [1688] DETAIL:  Connection matched pg_hba.conf line 1: "local postgres flypgadmin md5"

Wondering what’s going on here, do some passwords need changing… psql doesn’t work and says the server closed the connection.

The error about flypgadmin repeats, a lot, pretty much as fast as my terminal can scroll…

Right, I’ve managed to connect to postgres directly via port 5433 (as haproxy still has the server as unhealthy, presumably because flypgadmin can’t connect) – Which password do I need to set for flypgadmin (e.gg. OPERATOR_PASSWORD, SU_PASSWORD or REPL_PASSWORD) ?

Answer: It’s SU_PASSWORD

1 Like

This is fantastic @qqwy! Thanks very much for the guide :slight_smile:

If, like me, you had created a cluster using fly pg create, you can deploy the new [env] values using fly deploy -i flyio/postgres:14.4. Otherwise I got Error failed to fetch an image or build from source: app does not have a Dockerfile or buildpacks configured. See https://fly.io/docs/reference/configuration/#the-build-section. I used the version specified in the postgres-ha Dockerfile.

I also had to use PGPASSWORD, NOT PG_PASS or PGPASS (maybe depends on PG version, I’m using 14). Edit: I missed in the OP that PGPASSWORD should have been set as a secret - I skipped this because the new cluster had the other secrets already set and it gets cut off by the scroll bar.

2 Likes

If it helps anyone, there are some additional steps required when restoring to a new cluster. After executing the stolonctl init command (also doing all the other previous setup on the new cluster) you will get constant errors in the logs:

FATAL: password authentication failed for user "flypgadmin"

To fix, you need to update the SU_PASSWORD, REPL_PASSWORD and OPERATOR_PASSWORD secret values in the new cluster to match the old cluster. However it looks like the only way to get SU_PASSWORD and REPL_PASSWORD is to ssh into an instance of the old cluster and manually extract them (OPERATOR_PASSWORD is provided when first creating a cluster). So make sure to store them somewhere accessible!

After updating the secrets I can query the new cluster and see all the previous data!

Edit: Oops, this information is basically a duplicate of what @LeoAdamek has written above

1 Like

There is a little trick you can use to sync your credentials when restoring into a new cluster. If you set the FLY_RESTORED_FROM environment value to a random value, it will boot Postgres as if it were performing a restore from snapshot. This process should automatically reconfigure the flypgadmin, repl_user and postgres user credentials to match the credentials defined within the environment.

4 Likes

I end up with the fly app dashboard showing Failed due to unhealthy allocations - no stable job version to auto revert to.

fly status shows the postgres app to be in an error role. But wal-g does seem to be functioning, and I can fly ssh console successfully. Are these errors normal for now?

I have tried replicating these steps on a new Postgres v2 (machine-powered) cluster, to be able to update the guide with the changes required when working with machines.

However, it currently does not work. When adding the PGUSER and PGPASSWORD environment variables that are necessary to work with wal-g, the machine-based cluster starts misbehaving. Specifically, adding new nodes in the cluster no longer is possible.
I have added an issue to the Fly postgres-ha repo for this.

After a hectic Christmas time, I finally had some time now to revisit this.
Here is a complete guide to backup-and-restore on V2 of Fly’s architecture.

Steps to set up wal-g based point-in-time backups for Fly’s V2 postgres architecture

As starting point, this guide assumes you have a Fly postgres cluster running.

Set up env vars

Add the following environment variables (as secrets; we do not have a separate fly.toml for the DB app after all).

First write them out in a separate temporary file, so you can apply them all at once later (see below).

Preparing general settings:

  • Make a note of the $REPL_PASSWORD value inside the cluster (fly ssh --app=your-db-app-name and then printenv | sort)
  • Fill it in instead of "REPL_PASSWORD_HERE":
 ENABLE_WALG = "true"
 PGHOST = "localhost"
 PGPORT = "5432"
 PGUSER = "repluser"
 PGPASSWORD = "REPL_PASSWORD_HERE"

Preparing storage specific settings

Depending on what backup storage you use, you’ll need specific extra connection settings.
Details: wal-g/STORAGES.md at master · wal-g/wal-g · GitHub

As an example, for backing up to AWS S3 you’ll need:

AWS_ACCESS_KEY_ID="123451234512345"
AWS_SECRET_ACCESS_KEY="YOUR_SUPER_SECRET_TOKEN"
WALG_S3_PREFIX="s3://name-of-your-s3-bucket/some_folder"

Applying these settings

I recommend first making a file containing all FOO="bar" as separate lines,
and when happy add a \ at the end of every line (but the last).
You can then copy-paste this after fly secrets set --app=your-db-app-name, e.g.:

fly secrets set --app=your-db-app-name \
ENABLE_WALG="true" \
PGHOST="localhost" \
PGPORT="5432" \
...
WALG_S3_PREFIX="s3://foo/bar/baz"

and all secrets will be applied at once (and your DB cluster will only restart once, so downtime should be minimal).

Be sure to run fly status --watch --app=your-db-app-name
in a separate terminal window to see if the deploy is going well.

And of course keep an eye on the logs.
If all went well, ssh to your database cluster
and look at the output of stolonctl status.
It prints a table of all machines in your DB cluster. Each row should be ‘PG healthy’.

I recommend you save the file with all these settings for later (potentially with all passwords/secrets anonymized) so you will now how to set up a new cluster with exactly the same settings. This is important to swiftly restore from a backup.

Perform your first manual backup

Now to perform your first manual backup:

PGUSER=postgres PGPASSWORD=$OPERATOR_PASSWORD wal-g backup-push /data/postgres

If there are any connection errors, (because e.g. you misspelled a password token in one of the secrets), you’ll find out now.

  • If all goes well, you’ll see something along the lines of ‘finished writing part X’.
  • You can double-check whether the backup went well by going to your backup storage location (e.g. the AWS S3 bucket) and looking whether it now contains a folder called basebackups_005 and a folder called wal_005.
  • As extra checks, run any of the following:
    PGUSER=postgres PGPASSWORD=$OPERATOR_PASSWORD wal-g backup-list
    PGUSER=postgres PGPASSWORD=$OPERATOR_PASSWORD wal-g wal-show
    PGUSER=postgres PGPASSWORD=$OPERATOR_PASSWORD wal-g wal-verify integrity
    PGUSER=postgres PGPASSWORD=$OPERATOR_PASSWORD wal-g wal-verify timeline
    

I recommend setting up a periodic task (‘cronjob’) from inside one of your app servers (or possibly a separate tiny Fly VM) to run this wal-g backup-push command every so often (like every few days): The more recent your base backup snapshot, the faster the backup-restoration process will be.

Performing a restore

Create restore cluster

First, make a new cluster. Start with just a single node, as this way there will not be race conditions where some nodes still try to replicate the old DB while you reset the cluster.
The easiest way to do this, is to:

  • fly pg create
  • fill in app name
  • select ‘custom configuration’
  • 1 node
  • other settings to whatever you fancy.
    I recommend using the same settings as your original cluster.
    Note that the shared CPU/256MB does not have enough ram to perform the restore process at least in my testing. (You can always change it back down to this setting later if you really want.)

Set up env vars

Now, add secrets to this new cluster. This is similar to the step above (and I recommend re-using the file you saved earlier for this), except:

  • Do not add the ENABLE_WALG line!
  • Add values for SU_PASSWORD, OPERATOR_PASSWORD and REPL_PASSWORD so they match the values in the original cluster.
  • Add FLY_RESTORED_FROM="thevaluedoesnotmatter". This will ensure that at the next restart, Fly will update the passwords to

Check if this deploy goes well using fly status and fly logs.

Execute restore

Now, scale down the cluster so you only have 1 instance running.
If you are running a ‘dev mode’ cluster this is already the case.
Otherwise, run fly machine list --app=your-new-cluster-name
followed by a fly machine stop --app=your-new-cluster-name machine_id for each but one of the machines.

This will make sure that the replication process does not interfere with the restore (which otherwise can cause annoying race conditions).

Now in a fly ssh --app=your-new-cluster-name terminal window, run:

If you want to restore to the ‘latest’ backup:

export $(cat /data/.env | xargs)
stolonctl init --yes '{ "initMode": "pitr", "pitrConfig": {"dataRestoreCommand": "wal-g backup-fetch %d LATEST" , "archiveRecoverySettings": { "restoreCommand": "wal-g wal-fetch \"%f\" \"%p\"" } } }'

If you instead want to restore a particular point in time:

  • Replace the recoveryTargetTime with the desired time in the command below.
    Note that postgres only accepts this particular format of ISO8601 timestamps, and only with the timezone part set to UTC+00.
export $(cat /data/.env | xargs)
stolonctl init '{ "initMode": "pitr", "pitrConfig": { "recoveryTargetSettings": {"recoveryTargetTime": "2022-05-07 20:34:02.998+00"}, "dataRestoreCommand": "wal-g backup-fetch %d LATEST" , "archiveRecoverySettings": { "restoreCommand": "wal-g wal-fetch \"%f\" \"%p\"" } } }'

Check whether the restore goes according to plan by looking at fly logs.

Make restore cluster your new ‘real’ cluster

If you are happy with the restore, there are only three small things left to do:

  • If your old cluster is still operational, stop its backup streaming by running fly secrets unset ENABLE_WALG --app=your-old-cluster-name. Or feel free to stop/remove the cluster alltogether of course.
  • fly secrets set ENABLE_WALG="true" --app=your-new-cluster-name. From this point onward, the new cluster will start writing to the backup store.
  • Scale up the cluster to more than 1 node for high-availability: fly machines clone --app=your-new-cluster-name machine_id_of_first_machine
  • Finally, change the DATABASE_URL setting that you use inside your app(s) to point to the new cluster.