Wal-g is a tool which allows to do ‘streaming backups’ with Postgres. This means that rather than having daily or hourly backups where some data gets lost if disaster ever happened, you will lose nothing to a few seconds at most.
Related to this, wal-g allows you to restore from such a backup using a ‘point-in-time restore’ (also known as ‘PITR’), allowing you to restore to the exact moment just before any database problems started.
Especially for medium to large databases in which taking logical backups by running pg_dump
/pg_load
starts to take so long that it interrupts the normal workings of the app, or in which it is important that as little as data gets lost in the case of a failure, streaming backups and PITR can be a life-saver.
Recently a PR I made to add wal-g to Fly.io’s Postgres image was merged. This means that you can now make easy use of wal-g
from within your Fly’s Postgres images!
WAL-based PITR DB backups on Fly.io
I do need to disclaim that I am still fairly new to streaming backups, wal-g and stolon (the tool Fly uses to run Postgres in a highly-available cluster), so while I have written up a little guide which seems to work in the testing I have done so far, there might be things that are suboptimal. Definitely thoroughly test this yourself if you use it!
Setup
1. Ensure you are running a Fly Postgres cluster
If you do not have a cluster yet, create one using fly pg create
. write down the connection settings because you’ll immediately need them!
If you already were running a cluster, run fly image update --app=your-appname-db
to update your DB cluster. (This will restart your DB). wal-g
was added as part of image release v0.0.26.
I strongly recommend testing this out first on a new DB, and only once you understand how it works, test it on your ‘real’ DB.
1b. Pull down your DB-app’s fly.toml
Afterwards, make sure that you download the DB-app’s fly.toml
because we need to make some changes. I recommend doing this in a separate git repository, or in a subfolder of your main app’s git repo. This folder will be called your ‘DB-app directory’ in the next steps. Run fly config save --app=your-appname-db
.
1c. Scale down your cluster to a single running machine
fly scale count 1
This will make interacting with the cluster easier when we are making changes to the Stolon configuration later on.
Once we’re done with the setup we’ll scale it back up.
2. Setup app settings
From within the ‘DB-app’ directory, add the following ‘general’ env variables to the [env]
section in the fly.toml
ENABLE_WALG = "true"
PRIMARY_REGION = "ams" # Choose region you envision being the primary place for your DB for this
PGHOST = "localhost"
PGPORT = "5432"
PGUSER = "postgres"
- As well as the ‘specific’ settings for the desired backup storage option
https://github.com/wal-g/wal-g/blob/master/docs/STORAGES.md
Of course, keep passwords out of your fly.toml
, and set them as a secret (as part of the next step).
3. Add secrets for Postgres, Stolon and Wal-g to be able to read/write to the database.
Generate secure passwords for these, and write them down in a secure place (e.g. a password manager)
OPERATOR_PASSWORD and PGPASSWORD need to be the same.
BEWARE: If you’re using an existing cluster, SU_PASSWORD
REPL_PASSWORD
and OPERATOR_PASSWORD
are probably already set. If so, you don’t want to overwrite them.
fly secrets set SU_PASSWORD=correcthorse REPL_PASSWORD=batterystaple OPERATOR_PASSWORD=supersecret PGPASSWORD=supersecret
NOTE: Store all of these passwords in a safe place! You will need them again during a restore!
In this step, also set any backup storage-specific passwords.
4. Deploy your changes to the fly.toml
and secrets.
fly deploy
fly status
5. Update postgres configuration settings to ensure walg-backups are enabled
Now we need to make a small change to the Postgres configuration that is managed by Stolon.
This configuration is stored on the database ‘volume’, and synched throughout your database cluster by stolon, so you only need to make this change on one of your DB machines, and it will persist across restarts.
This change will tell Stolon to enable streaming backups, and call wal-g’s ‘wal-push’ command every 60 seconds.
fly ssh console # in your DB-app diirectory
export $(cat /data/.env | xargs)
stolonctl update --patch '{"pgParameters": { "archive_mode": "yes", "archive_timeout": "60", "archive_command": "wal-g wal-push %p"}}'
You’ll be able to see in the logs (fly logs
) whether everything is working as intended or whether configuration needs changing.
6. Scale up
If it’s a new cluster
Add a replica in the same region (the ‘highly-available’ replica)
fly volumes create pg_data --region ams --size 10
fly scale count 2
fly status
and then add a replica in another region (a ‘read-replica’)
fly volumes create pg_data --region fra --size 10
fly scale count 3
fly status
If it’s an existing cluster
Just a simple fly scale count 3
or whatever your original count was should do.
8. If it’s a new cluster: Fill the database with some data
You probably know how to do this already .
9. Make your first ‘traditional’ backup
NOTE: The indicated path /data/postgres
is the same for all VMs that use Fly’s Postgres image
fly ssh console
wal-g backup-push /data/postgres
WARNING: You probably want to set up a cronjob to perform ‘traditional’ backups backups regularly, to allow point-in-time restores to be faster (from the lastest ‘traditional’ backup rather than from this initial one). This is not covered in this guide.
(You can do so by re-running the same command in the future. Either manually, or by e.g. adding a cronjob. Adding a cronjob inside the Fly Postgres image requires changing its Dockerfile, unfortunately. Alternatively you could run a separate Fly app that is able to connect to this DB and use the same wal-g configuration settings that contains such a cronjob, of course. There is definite room for improvement here.)
10. Check whether backups are working correctly
fly ssh console
wal-g backup-list
wal-g wal-show
wal-g wal-verify integrity
wal-g wal-verify timeline
Besides this check, it is highly recommended to periodically test whether your ‘restoring from the backups’ flow is still working correctly.
Restoring from the backups
1. Preparation: scale cluster down to a single postgres instance.
Alternatively, create a new cluster using the steps above, to restore in (steps 1-4).
fly scale count 1
2. Re-init Stolon in ‘pitr’ mode.
If you want to restore the ‘latest’ backup:
fly ssh console
export $(cat /data/.env | xargs)
stolonctl init '{ "initMode": "pitr", "pitrConfig": {"dataRestoreCommand": "wal-g backup-fetch %d LATEST" , "archiveRecoverySettings": { "restoreCommand": "wal-g wal-fetch \"%f\" \"%p\"" } } }'
If you want to restore a particular point in time:
- Replace the recoveryTargetTime with the desired time in the command below.
Note that postgres only accepts this particular format of ISO8601 timestamps, and only for UTC+00.
fly ssh console
export $(cat /data/.env | xargs)
stolonctl init '{ "initMode": "pitr", "pitrConfig": { "recoveryTargetSettings": {"recoveryTargetTime": "2022-05-07 20:34:02.998+00"}, "dataRestoreCommand": "wal-g backup-fetch %d LATEST" , "archiveRecoverySettings": { "restoreCommand": "wal-g wal-fetch \"%f\" \"%p\"" } } }'
3. Keep track of reinitialization
using fly logs
and other metrics at your disposal.
If it seems that Stolon is stuck in a ‘boot loop’ after the call to stolonctl init
, kill the ‘stolon-keeper’ OS process and try from step 2 once more.
(ps aux | grep stolon
will tell you its PID, and then kill PID
to kill it.)
4. Do not forget to scale up again if backup is successful
fly scale count 3
And, if you created a new cluster, now enable backups for this new cluster (step 5 of cluster creation).
Congrats, you’re done!
Let me know how you fare.
Feedback on these steps are of course very welcome.