Point-in-time backups using Postgres and wal-g

Wal-g is a tool which allows to do ‘streaming backups’ with Postgres. This means that rather than having daily or hourly backups where some data gets lost if disaster ever happened, you will lose nothing to a few seconds at most.
Related to this, wal-g allows you to restore from such a backup using a ‘point-in-time restore’ (also known as ‘PITR’), allowing you to restore to the exact moment just before any database problems started.

Especially for medium to large databases in which taking logical backups by running pg_dump/pg_load starts to take so long that it interrupts the normal workings of the app, or in which it is important that as little as data gets lost in the case of a failure, streaming backups and PITR can be a life-saver.

Recently a PR I made to add wal-g to Fly.io’s Postgres image was merged. This means that you can now make easy use of wal-g from within your Fly’s Postgres images! :partying_face:


WAL-based PITR DB backups on Fly.io

I do need to disclaim that I am still fairly new to streaming backups, wal-g and stolon (the tool Fly uses to run Postgres in a highly-available cluster), so while I have written up a little guide which seems to work in the testing I have done so far, there might be things that are suboptimal. Definitely thoroughly test this yourself if you use it!

Setup

1. Ensure you are running a Fly Postgres cluster

If you do not have a cluster yet, create one using fly pg create. write down the connection settings because you’ll immediately need them!

If you already were running a cluster, run fly image update --app=your-appname-db to update your DB cluster. (This will restart your DB). wal-g was added as part of image release v0.0.26.

I strongly recommend testing this out first on a new DB, and only once you understand how it works, test it on your ‘real’ DB.

1b. Pull down your DB-app’s fly.toml

Afterwards, make sure that you download the DB-app’s fly.toml because we need to make some changes. I recommend doing this in a separate git repository, or in a subfolder of your main app’s git repo. This folder will be called your ‘DB-app directory’ in the next steps. Run fly config save --app=your-appname-db.

1c. Scale down your cluster to a single running machine

fly scale count 1
This will make interacting with the cluster easier when we are making changes to the Stolon configuration later on.
Once we’re done with the setup we’ll scale it back up.

2. Setup app settings

From within the ‘DB-app’ directory, add the following ‘general’ env variables to the [env] section in the fly.toml

ENABLE_WALG = "true"
PRIMARY_REGION = "ams" # Choose region you envision being the primary place for your DB for this
PGHOST = "localhost"
PGPORT = "5432"
PGUSER = "postgres"

Of course, keep passwords out of your fly.toml, and set them as a secret (as part of the next step).

3. Add secrets for Postgres, Stolon and Wal-g to be able to read/write to the database.

Generate secure passwords for these, and write them down in a secure place (e.g. a password manager)

OPERATOR_PASSWORD and PGPASSWORD need to be the same.

BEWARE: If you’re using an existing cluster, SU_PASSWORD REPL_PASSWORD and OPERATOR_PASSWORD are probably already set. If so, you don’t want to overwrite them.

fly secrets set SU_PASSWORD=correcthorse REPL_PASSWORD=batterystaple OPERATOR_PASSWORD=supersecret PGPASSWORD=supersecret

NOTE: Store all of these passwords in a safe place! You will need them again during a restore!

In this step, also set any backup storage-specific passwords.

4. Deploy your changes to the fly.toml and secrets.

fly deploy
fly status

5. Update postgres configuration settings to ensure walg-backups are enabled

Now we need to make a small change to the Postgres configuration that is managed by Stolon.
This configuration is stored on the database ‘volume’, and synched throughout your database cluster by stolon, so you only need to make this change on one of your DB machines, and it will persist across restarts.

This change will tell Stolon to enable streaming backups, and call wal-g’s ‘wal-push’ command every 60 seconds.

fly ssh console # in your DB-app diirectory
export $(cat /data/.env | xargs)
stolonctl update --patch '{"pgParameters": { "archive_mode": "yes", "archive_timeout": "60", "archive_command": "wal-g wal-push %p"}}'

You’ll be able to see in the logs (fly logs ) whether everything is working as intended or whether configuration needs changing.

6. Scale up

If it’s a new cluster

Add a replica in the same region (the ‘highly-available’ replica)

fly volumes create pg_data --region ams --size 10
fly scale count 2
fly status

and then add a replica in another region (a ‘read-replica’)

fly volumes create pg_data --region fra --size 10
fly scale count 3
fly status

If it’s an existing cluster

Just a simple fly scale count 3 or whatever your original count was should do.

8. If it’s a new cluster: Fill the database with some data

You probably know how to do this already :stuck_out_tongue_winking_eye:.

9. Make your first ‘traditional’ backup

NOTE: The indicated path /data/postgres is the same for all VMs that use Fly’s Postgres image

fly ssh console
wal-g backup-push /data/postgres

WARNING: You probably want to set up a cronjob to perform ‘traditional’ backups backups regularly, to allow point-in-time restores to be faster (from the lastest ‘traditional’ backup rather than from this initial one). This is not covered in this guide.
(You can do so by re-running the same command in the future. Either manually, or by e.g. adding a cronjob. Adding a cronjob inside the Fly Postgres image requires changing its Dockerfile, unfortunately. Alternatively you could run a separate Fly app that is able to connect to this DB and use the same wal-g configuration settings that contains such a cronjob, of course. There is definite room for improvement here.)

10. Check whether backups are working correctly

fly ssh console
wal-g backup-list
wal-g wal-show
wal-g wal-verify integrity
wal-g wal-verify timeline

Besides this check, it is highly recommended to periodically test whether your ‘restoring from the backups’ flow is still working correctly.

Restoring from the backups

1. Preparation: scale cluster down to a single postgres instance.

Alternatively, create a new cluster using the steps above, to restore in (steps 1-4).

fly scale count 1

2. Re-init Stolon in ‘pitr’ mode.

If you want to restore the ‘latest’ backup:

fly ssh console
export $(cat /data/.env | xargs)
stolonctl init '{ "initMode": "pitr", "pitrConfig": {"dataRestoreCommand": "wal-g backup-fetch %d LATEST" , "archiveRecoverySettings": { "restoreCommand": "wal-g wal-fetch \"%f\" \"%p\"" } } }'

If you want to restore a particular point in time:

  • Replace the recoveryTargetTime with the desired time in the command below.
    Note that postgres only accepts this particular format of ISO8601 timestamps, and only for UTC+00.
fly ssh console
export $(cat /data/.env | xargs)
stolonctl init '{ "initMode": "pitr", "pitrConfig": { "recoveryTargetSettings": {"recoveryTargetTime": "2022-05-07 20:34:02.998+00"}, "dataRestoreCommand": "wal-g backup-fetch %d LATEST" , "archiveRecoverySettings": { "restoreCommand": "wal-g wal-fetch \"%f\" \"%p\"" } } }'

3. Keep track of reinitialization

using fly logs and other metrics at your disposal.

If it seems that Stolon is stuck in a ‘boot loop’ after the call to stolonctl init, kill the ‘stolon-keeper’ OS process and try from step 2 once more.
(ps aux | grep stolon will tell you its PID, and then kill PID to kill it.)

4. Do not forget to scale up again if backup is successful

fly scale count 3

And, if you created a new cluster, now enable backups for this new cluster (step 5 of cluster creation).


Congrats, you’re done!

Let me know how you fare.
Feedback on these steps are of course very welcome.

14 Likes

One suggested change: If you’re using an existing cluster, SU_PASSWORD REPL_PASSWORD OPERATOR_PASSWORD are already set, and you don’t want to overwrite them.

Otherwise, this is great and I’m thrilled at how easy it was. Thank you!

1 Like

I’ve changed it in the guide. Thanks for the tip! :+1:

You’re very welcome. :blush:

This is great, but for some reason the initial backup with wal-g backup-push $PGDATA fails with:

Connect: postgres connection failed: unexpected message type

The keeper logs show the following:

FATAL: password authentication failed for user “postgres”

I tried export PGPASS=$SU_PASSWORD but that didn’t seem to help. Any ideas on how to take an initial snapshot?

Strangely enough the WAL backups are working just fine!

I tried export PGPASS=$SU_PASSWORD but that didn’t seem to help. Any ideas on how to take an initial snapshot?

Try prepending it to your command.

E.G.
PG_PASS=$SU_PASSWORD wal-g backup-push /data/postgres

No Joy, but it shouldn’t need the password as far as I can tell as it’s designed to operate on the $PGDATA directory directly, so not sure why it’s trying to connect to pg?

I figured it out, it’s the OPERATOR_PASSWORD that’s required, not the SU_PASSWORD

2 Likes

Thank you very much for this @qqwy !!

Newbie question; Does it only work for replicated clusters or can I use it with a single postgres node?

The backup process is separate from replication (both use the WAL, the write-ahead log, as data source, but they do not depend on one another at all.) You can use in exactly the same way with a single postgres node.

2 Likes

After the PITR has finished running, does the cluster need to be re-initialized from /fly/cluster-spec.json – seems that after the recovery is done we get errors about flypgadmin not being able to auth, and any vm restarts causes the PITR to be restarted.

2022-09-23T13:15:28Z app[94abeaf7] fra [info]keeper   | 2022-09-23 13:15:28.436 UTC [872] LOG:  selected new timeline ID: 3
2022-09-23T13:15:28Z app[94abeaf7] fra [info]keeper   | ERROR: 2022/09/23 13:15:28.457109 Archive '0000000200000034000000AB' does not exist.
2022-09-23T13:15:28Z app[94abeaf7] fra [info]keeper   |
2022-09-23T13:15:28Z app[94abeaf7] fra [info]keeper   | 2022-09-23 13:15:28.467 UTC [872] LOG:  archive recovery complete
2022-09-23T13:15:28Z app[94abeaf7] fra [info]keeper   | ERROR: 2022/09/23 13:15:28.764649 Archive '00000002.history' does not exist.
2022-09-23T13:15:28Z app[94abeaf7] fra [info]keeper   | 2022-09-23 13:15:28.886 UTC [871] LOG:  database system is ready to accept connections
2022-09-23T13:15:29Z app[94abeaf7] fra [info]checking stolon status
2022-09-23T13:15:29Z app[94abeaf7] fra [info]keeper   | 2022-09-23 13:15:29.393 UTC [1685] FATAL:  password authentication failed for user "flypgadmin"
2022-09-23T13:15:29Z app[94abeaf7] fra [info]keeper   | 2022-09-23 13:15:29.393 UTC [1685] DETAIL:  Connection matched pg_hba.conf line 1: "local postgres flypgadmin md5"
2022-09-23T13:15:29Z app[94abeaf7] fra [info]keeper   | 2022-09-23 13:15:29.599 UTC [1686] FATAL:  password authentication failed for user "flypgadmin"
2022-09-23T13:15:29Z app[94abeaf7] fra [info]keeper   | 2022-09-23 13:15:29.599 UTC [1686] DETAIL:  Connection matched pg_hba.conf line 1: "local postgres flypgadmin md5"
2022-09-23T13:15:29Z app[94abeaf7] fra [info]keeper   | 2022-09-23 13:15:29.806 UTC [1687] FATAL:  password authentication failed for user "flypgadmin"
2022-09-23T13:15:29Z app[94abeaf7] fra [info]keeper   | 2022-09-23 13:15:29.806 UTC [1687] DETAIL:  Connection matched pg_hba.conf line 1: "local postgres flypgadmin md5"
2022-09-23T13:15:29Z app[94abeaf7] fra [info]keeper   | 2022-09-23 13:15:29.876 UTC [1688] FATAL:  password authentication failed for user "flypgadmin"
2022-09-23T13:15:29Z app[94abeaf7] fra [info]keeper   | 2022-09-23 13:15:29.876 UTC [1688] DETAIL:  Connection matched pg_hba.conf line 1: "local postgres flypgadmin md5"

Wondering what’s going on here, do some passwords need changing… psql doesn’t work and says the server closed the connection.

The error about flypgadmin repeats, a lot, pretty much as fast as my terminal can scroll…

Right, I’ve managed to connect to postgres directly via port 5433 (as haproxy still has the server as unhealthy, presumably because flypgadmin can’t connect) – Which password do I need to set for flypgadmin (e.gg. OPERATOR_PASSWORD, SU_PASSWORD or REPL_PASSWORD) ?

Answer: It’s SU_PASSWORD

1 Like

This is fantastic @qqwy! Thanks very much for the guide :slight_smile:

If, like me, you had created a cluster using fly pg create, you can deploy the new [env] values using fly deploy -i flyio/postgres:14.4. Otherwise I got Error failed to fetch an image or build from source: app does not have a Dockerfile or buildpacks configured. See https://fly.io/docs/reference/configuration/#the-build-section. I used the version specified in the postgres-ha Dockerfile.

I also had to use PGPASSWORD, NOT PG_PASS or PGPASS (maybe depends on PG version, I’m using 14). Edit: I missed in the OP that PGPASSWORD should have been set as a secret - I skipped this because the new cluster had the other secrets already set and it gets cut off by the scroll bar.

If it helps anyone, there are some additional steps required when restoring to a new cluster. After executing the stolonctl init command (also doing all the other previous setup on the new cluster) you will get constant errors in the logs:

FATAL: password authentication failed for user "flypgadmin"

To fix, you need to update the SU_PASSWORD, REPL_PASSWORD and OPERATOR_PASSWORD secret values in the new cluster to match the old cluster. However it looks like the only way to get SU_PASSWORD and REPL_PASSWORD is to ssh into an instance of the old cluster and manually extract them (OPERATOR_PASSWORD is provided when first creating a cluster). So make sure to store them somewhere accessible!

After updating the secrets I can query the new cluster and see all the previous data!

Edit: Oops, this information is basically a duplicate of what @LeoAdamek has written above

1 Like

There is a little trick you can use to sync your credentials when restoring into a new cluster. If you set the FLY_RESTORED_FROM environment value to a random value, it will boot Postgres as if it were performing a restore from snapshot. This process should automatically reconfigure the flypgadmin, repl_user and postgres user credentials to match the credentials defined within the environment.