We are glad to announce our Postgres Flex image now comes with built in support for Barman for Point in Time Recovery (PITR) and flyctl
now ships with helper methods to make setting up and using it easier.
tl;dr:
- This is a customer-managed barman setup.
fly pg barman create
will create a machine in your Postgres cluster with barman ready to use.- You can use
fly pg barman
commands to backup and restore your cluster. - Make sure your postgres-flex cluster has the latest image.
- The barman volume will live on a volume that’s not on the same host as your cluster for safety
Why does it matters to you?
Fly.io comes with daily backups for volumes and that can already help with some disaster recovery scenarios when things go bad. But you probably don’t want to lose any data at all, most likely you want your Recovery Point Object to be zero and Barman offers that with minimal headache.
Why Barman?
Barman has great support for streaming replication to store WAL files and also works well with repmgr, which we use for Postgres Flex so it felt like a natural fit. Upon testing with Fly provisioned cluster it could not work better.
How does this work?
When you run fly pg barman create -a APP_NAME
we will create a utility machine specifically for barman that will connect to your cluster using streaming protocol. Here’s a an image from Barman docs on how this works.
All connections will work out of the box. Just like you read-replicas use streaming replication, barman will also create a replication slot for itself so it can receive WAL files from your primary node.
Make sure your cluster image is up to date
You can use fly status
to check if your cluster needs image updates to be able to support barman. Support started at v0.0.42
.
Whenever you see
APP_NAME
on commands, rewrite as your database app name.
$ fly status -a APP_NAME
Machine "9185ee7a236283" flyio/postgres-flex:15.3 (v0.0.41) -> flyio/postgres-flex:15.3 (v0.0.42)
Run `flyctl image update` to migrate to the latest image version.
ID STATE ROLE REGION CHECKS IMAGE CREATED UPDATED
9185ee7a236283 started primary gru 3 total, 3 passing flyio/postgres-flex:15.3 (v0.0.41) 2023-05-25T13:59:50Z 2023-05-25T14:00:05Z
$ fly image update -a APP_NAME
The following changes will be applied to all Postgres machines.
Machines not running the official Postgres image will be skipped.
... // 84 identical lines
}
},
- "image": "flyio/postgres-flex:15.3@sha256:814448113018b101e900c1fb875a75f98178f0978ac0715a7638bfb8a11bb59d",
+ "image": "registry-1.docker.io/flyio/postgres-flex:15.3@sha256:c380a6108f9f49609d64e5e83a3117397ca3b5c3202d0bf0996883ec3dbb80c8",
"restart": {
"policy": "always"
... // 7 identical lines
? Apply changes? Yes
Identifying cluster role(s)
Machine 9185ee7a236283: primary
Updating machine 9185ee7a236283
Waiting for 9185ee7a236283 to become healthy (started, 3/3)
Machine 9185ee7a236283 updated successfully!
Postgres cluster has been successfully updated!
Creating your barman machine
All you need to do is run a single command. Two things you’d like to note here is that it’s recommended to create it on the same region as your primary node, in my case it was gru
. The second thing is that your volume should be able to handle your Postgres base backups so it’s recommended that the volume is at least the same size as your primary.
$ fly pg barman create -a APP_NAME
? Select a region. Prefer closer to the primary Sao Paulo, Brazil (gru)
? Select VM size: performance-1x - CPU Kind: Performance, vCPUs: 1, Memory: 2GB
? Volume size (should be at least the size of the other volumes) 40
Provisioning volume with 40GB
Provisioning barman machine with image registry-1.docker.io/lubien/postgres-flex-barman:0.0.21
Waiting for machine to start...
Machine 148ed461ce4d89 is created
Fly will also make sure not to create the barman volume on the same host as your existing nodes, we use a flag when creating the volume for you that require an unique zone for this specific volume.
Checking if your setup is working
You can use fly pg barman check -a APP_NAME
to get stats from your barman machine. If everything is fine you’re gonna see all checks as OK
.
$ fly pg barman check -a APP_NAME
Connecting to fdaa:0:3335:a7b:145:4fcf:c8f:2... complete
Server pg:
PostgreSQL: OK
superuser or standard user with backup privileges: OK
PostgreSQL streaming: OK
wal_level: OK
replication slot: OK
directories: OK
retention policy settings: OK
backup maximum age: OK (no last_backup_maximum_age provided)
backup minimum size: OK (0 B)
wal maximum age: OK (no last_wal_maximum_age provided)
wal size: OK (0 B)
compression settings: OK
failed backups: OK (there are 0 failed backups)
minimum redundancy requirements: OK (have 0 backups, expected at least 0)
pg_basebackup: OK
pg_basebackup compatible: OK
pg_basebackup supports tablespaces mapping: OK
systemid coherence: OK (no system Id stored on disk)
pg_receivexlog: OK
pg_receivexlog compatible: OK
receive-wal running: OK
archiver errors: OK
One common issue that might happen is that if your app is not write-heavy your WAL files might take time to ship the first finished page since Postgres will wait until it fills an entire page (by default 16MB) before shipping it to barman. If that happens you’re going to see a check
for WAL archive: FAILED
like this:
$ fly pg barman check -a APP_NAME
Connecting to fdaa:0:3335:a7b:145:4fcf:c8f:2... complete
Server pg:
WAL archive: FAILED (please make sure WAL shipping is setup)
PostgreSQL: OK
superuser or standard user with backup privileges: OK
PostgreSQL streaming: OK
directories: OK
retention policy settings: OK
backup maximum age: OK (no last_backup_maximum_age provided)
backup minimum size: OK (0 B)
wal maximum age: OK (no last_wal_maximum_age provided)
wal size: OK (0 B)
compression settings: OK
failed backups: OK (there are 0 failed backups)
minimum redundancy requirements: OK (have 0 backups, expected at least 0)
pg_basebackup: OK
pg_basebackup compatible: OK
systemid coherence: OK (no system Id stored on disk)
pg_receivexlog: OK
pg_receivexlog compatible: OK
receive-wal running: OK
archiver errors: OK
Error: ssh shell: Process exited with status 1
In case you want to ensure barman gets synced immediately we prepared a command for you: fly pg barman switch-wal -a APP_NAME
. That will make Postgres finish the current page and send to replicas and barman, then feel free to run the check
command again.
$ fly pg barman switch-wal -a APP_NAME
Connecting to fdaa:0:3335:a7b:145:4fcf:c8f:2... complete
The WAL file 000000010000000000000008 has been closed on server 'pg'
Waiting for the WAL file 000000010000000000000008 from server 'pg' (max: 30 seconds)
Processing xlog segments from streaming for pg
000000010000000000000008
Preparing your first backup
For barman to be able to restore your WAL files up to a date you want you first need to take a Base Backup. Base backups are copies from your PostgreSQL as is. These are needed to apply PITR because whenever you recover your database you can choose a base backup to be a starting point of the recovery and then barman will ship the missing WAL files from that point forward. Let’s start simple: to make your first base backup run this command.
$ fly pg barman backup -a APP_NAME
Connecting to fdaa:0:3335:a7b:145:4fcf:c8f:2... complete
Starting backup using postgres method for server pg in /data/barman.d/pg/base/20230524T173406
Backup start at LSN: 0/9000148 (000000010000000000000009, 00000148)
Starting backup copy via pg_basebackup for 20230524T173406
Copy done (time: less than one second)
Finalising the backup.
This is the first backup for server pg
WAL segments preceding the current backup have been found:
000000010000000000000006 from server pg has been removed
000000010000000000000007 from server pg has been removed
000000010000000000000008 from server pg has been removed
Backup size: 29.6 MiB
Backup end at LSN: 0/B000000 (00000001000000000000000A, 00000000)
Backup completed (start time: 2023-05-24 17:34:06.678348, elapsed time: less than one second)
Processing xlog segments from streaming for pg
000000010000000000000009
00000001000000000000000A
You can verify that everything is okay using list-backups
and show-backup
:
$ fly pg barman list-backup -a APP_NAME
Connecting to fdaa:0:3335:a7b:145:4fcf:c8f:2... complete
pg 20230524T173406 - Wed May 24 17:34:07 2023 - Size: 45.6 MiB - WAL Size: 0 B
$ fly pg barman show-backup 20230524T173406 -a APP_NAME
Connecting to fdaa:0:3335:a7b:145:4fcf:c8f:2... complete
Backup 20230524T173406:
Server Name : pg
System Id : 7236768638320013868
Status : DONE
PostgreSQL Version : 150002
PGDATA directory : /data/postgresql
Base backup information:
Disk usage : 29.6 MiB (45.6 MiB with WALs)
Incremental size : 29.6 MiB (-0.00%)
Timeline : 1
Begin WAL : 00000001000000000000000A
End WAL : 00000001000000000000000A
WAL number : 1
Begin time : 2023-05-24 17:34:06.676647+00:00
End time : 2023-05-24 17:34:07.448928+00:00
Copy time : less than one second
Estimated throughput : 41.3 MiB/s
Begin Offset : 40
End Offset : 0
Begin LSN : 0/A000028
End LSN : 0/B000000
WAL information:
No of files : 0
Disk usage : 0 B
Last available : 00000001000000000000000A
Catalog information:
Retention Policy : VALID
Previous Backup : - (this is the oldest base backup)
Next Backup : - (this is the latest base backup)
If you list-backup says WAITING_FOR_WALS
it’s fine, it’s the same thing as described above when you first setup barman, your database might have not finished a page yet and barman will receive it as soon as possible. The important bit about show-backup
is that you can now perform PITR for any date starting from your first backup’s End time
. In my case, at least from 2023-05-24 17:34:07.448928+00:00
.
Performing a disaster recover
Everything went wrong and now your database is busted. Let’s see how we can recover. For this example we are going to create a test database and will insert some test data, wait a few seconds and then insert more test data. Do note I’m inserting 1000 entries each run so my WAL page is filled and sent to barman.
$ fly pg connect -a APP_NAME
postgres=# CREATE TABLE IF NOT EXISTS testtbl (time TIMESTAMP NULL);
CREATE TABLE
postgres=# INSERT INTO testtbl SELECT NOW() FROM generate_series(1, 1000);
INSERT 0 1000
# wait and repeat the command above
postgres=# select distinct * from testtbl order by time asc; \q\n
time
----------------------------
2023-05-24 18:20:58.035307
2023-05-24 18:21:43.332269
2023-05-24 18:22:52.374381
2023-05-24 18:25:09.704642
postgres=# drop table testtbl;
DROP TABLE
Oh no, my table is gone. Let’s check backups. You can use fly barman show-backup latest
, here’s the simplified output:
$ fly pg barman show-backup latest -a APP_NAME
Connecting to fdaa:0:3335:a7b:145:4fcf:c8f:2... complete
Backup 20230524T173406:
Server Name : pg
System Id : 7236768638320013868
Status : DONE
Base backup information:
# more data
Begin time : 2023-05-24 17:34:06.676647+00:00
End time : 2023-05-24 17:34:07.448928+00:00
# more data
You just happen to know the table was dropped after 2023-05-24 18:29:09.704642
and your base backup ended before that so you can recover from a time I know is safe which could be 2023-05-24 18:28:09.704642
.
First, you want to destroy your replicas. We recommend you to use the monitoring page on our fly.io dashboard to know which machines are replicas based on health checks. The URL should se fly.io/apps/APP_NAME/monitoring
. Make sure only the primary and barman nodes are left.
From the image above I can see I need to keep machines 5683d946c4018e
(primary) and 148ed461ce4d89
(barman). The reason for that is we will need to restart the primary node to restore it and if repmgr catches the fact that the primary is offline it will elect a new primary and could lead to a split brain situation and you don’t want that to plus it’s quite easy to scale your cluster back.
$ fly machine destroy --force 148e22db120789
machine 148e22db120789 was found and is currently in started state, attempting to destroy...
unregistering postgres member 'fdaa:0:3335:a7b:1f60:bb88:35f6:2' from the cluster... WARN The running flyctl agent (v0.0.0-1684941583+dev) is older than the current flyctl (v0.1.10).
148e22db120789 has been destroyed
$ fly machine destroy --force e2865ed3c32e86
machine e2865ed3c32e86 was found and is currently in started state, attempting to destroy...
unregistering postgres member 'fdaa:0:3335:a7b:136:fa81:fb4d:2' from the cluster... (success)
e2865ed3c32e86 has been destroyed
Now it’s time to recover. We shipped a command just for that:
$ flyctl postgres barman recover --help
Recover primary database with a barman backup
Usage:
flyctl postgres barman recover <primary machine ID> [flags]
Flags:
-a, --app string Application name
-b, --backup-id string choose one backup ID. Default: latest (default "latest")
-h, --help help for recover
-T, --target-time string choose a target time for PITR. Example: "2023-05-16 20:55:05.958774+00:00". Default: last WAL file on barman
Global Flags:
-t, --access-token string Fly API Access Token
--verbose Verbose output
The 3 things you care about here are the primary machine ID, the backup ID and target time. You know from previous steps that our primary machine ID is 5683d946c4018e
, you also know you can use latest
for backup ID (you can omit the flag if you want). and the target time we decided upon was 2023-05-24 18:28:09.704642
. Let’s put that all together and see what happens.
IMPORTANT: at all times keep your logs visible. There’s a million things that can happen doing this, you want to make sure you know when your Postgres say something about your recovery state. I recommend you to open the monitoring page on Focus Mode for your primary VM. The URL should be like this:
https://fly.io/apps/APP_NAME/monitoring?instance=MACHINE_ID
, just click on any machine ID to start it.
$ barman recover 5683d946c4018e -b latest --target-time "2023-05-24 18:28:09.704642"
2023/05/24 15:55:07 Barman server will run the following command: barman recover pg --get-wal latest /data/postgresql --remote-ssh-command "ssh root@5683d946c4018e.vm.APP_NAME.internal -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null" --target-time "2023-05-24 18:28:09.704642"
Connecting to fdaa:0:3335:a7b:145:4fcf:c8f:2... complete
Starting remote restore for server pg using backup 20230524T173406
Destination directory: /data/postgresql
Remote command: ssh root@5683d946c4018e.vm.APP_NAME.internal -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
Doing PITR. Recovery target time: '2023-05-24 18:28:09.704642+00:00'
Copying the base backup.
Generating recovery configuration
Identify dangerous settings in destination directory.
WARNING
You are required to review the following options as potentially dangerous
postgresql.conf line 815: include = 'postgresql.internal.conf'
postgresql.conf line 816: include = 'postgresql.user.conf'
WARNING: 'get-wal' is in the specified 'recovery_options'.
Before you start up the PostgreSQL server, please review the postgresql.auto.conf file
inside the target directory. Make sure that 'restore_command' can be executed by the PostgreSQL user.
Recovery completed (start time: 2023-05-24 18:55:14.768077+00:00, elapsed time: 1 second)
Your PostgreSQL server has been successfully prepared for recovery!
At this point your primary node should have 2 failing health checks because your Postgres will stop, that’s fine. Let’s take a peek on the VM to see what happened!
# select your primary
$ fly ssh console -s --pty -C "/bin/bash" -a APP_NAME
? Select VM: gru: 5683d946c4018e fdaa:0:3335:a7b:145:c39f:6620:2 withered-resonance-4917
Connecting to fdaa:0:3335:a7b:145:c39f:6620:2... complete
root@5683d946c4018e:/# cd data/postgresql/
root@5683d946c4018e:/data/postgresql# cat postgresql.auto.conf
# Do not edit this file manually!
# It will be overwritten by the ALTER SYSTEM command.
# The 'barman-wal-restore' command is provided in the 'barman-cli' package
restore_command = 'barman-wal-restore -P -U root 148ed461ce4d89 pg %f %p'
recovery_target_time = '2023-05-24 18:28:09.704642'
root@5683d946c4018e:/data/postgresql# ls -la barman_wal/
total 12
drwxr-xr-x 3 root root 4096 May 24 18:55 .
drwx------ 20 postgres postgres 4096 May 24 18:55 ..
drwxr-xr-x 2 root root 4096 May 24 18:55 archive_status
This postgresql.auto.conf file is temporary and will be removed as soon as the backup is finished. You can also verify there’s a recovery.signal
file there. We need to restart this VM to start the backup process. Exit the console and let’s do this. Keep looking your primary node logs, they will show some errors at first but will eventually reach a stable recovery. You want to use fly machine restart
to restart only a single machine instead of fly pg restart
which also would restart barman. You also want to --skip-health-checks
since it’s likely your DB is going to have failing checks until recovery is done.
$ fly m restart 5683d946c4018e --skip-health-checks -a APP_NAME
Restarting machine 5683d946c4018e
Machine 5683d946c4018e restarted successfully!
Now if you’re paying attention to your logs you will notice that it’s asking you to run pg_wal_replay_resume()
and the primary node became a zombie, that’s fine.
[info] postgres | 2023-05-24 19:02:51.942 UTC [594] LOG: consistent recovery state reached at 0/A000100
[info] postgres | 2023-05-24 19:02:51.942 UTC [554] LOG: database system is ready to accept read-only connections
[info] postgres | 2023-05-24 19:02:51.948 UTC [594] LOG: recovery stopping before commit of transaction 787, time 2023-05-24 18:40:11.704292+00
[info] postgres | 2023-05-24 19:02:51.948 UTC [594] LOG: pausing at the end of recovery
[info] postgres | 2023-05-24 19:02:51.948 UTC [594] HINT: Execute pg_wal_replay_resume() to promote.
Let’s fix this. We first need to once more SSH into the primary node and go to /data/postgres
. You can see that the recovery files and folder are still there. Also if you ls /data
you’re going to see the zombie.lock file.
# select your primary
$ fly ssh console -s --pty -C "/bin/bash" -a APP_NAME
? Select VM: gru: 5683d946c4018e fdaa:0:3335:a7b:145:c39f:6620:2 withered-resonance-4917
Connecting to fdaa:0:3335:a7b:145:c39f:6620:2... complete
root@5683d946c4018e:/# cd data/postgresql/
root@5683d946c4018e:/data/postgresql# ls -lah
total 364K
drwx------ 20 postgres postgres 4.0K May 24 19:18 .
drwxr-xr-x 5 postgres postgres 4.0K May 24 19:18 ..
drwxr-xr-x 3 postgres postgres 4.0K May 24 19:17 barman_wal
-rw------- 1 postgres postgres 290 May 24 19:17 postgresql.auto.conf
-rw-r--r-- 1 postgres postgres 0 May 24 19:17 recovery.signal
# more files omitted
root@5683d946c4018e:/data/postgresql# ls /data
flypg.internal.conf lost+found readonly.lock repmgr.internal.conf zombie.lock
flypg.user.conf postgresql repmgr.conf repmgr.user.conf
Now we need to resume our database system. Usually you could connect to it from port 5432, but since it’s on a zombie state, this node is treated as a replica so we need to connect through port 5433 and run pg_wal_replay_resume()
.
root@5683d946c4018e:/data/postgresql# psql "postgres://repmgr:$REPL_PASSWORD@$FLY_APP_NAME.internal:5433/postgres"
psql (15.2 (Debian 15.2-1.pgdg110+1))
Type "help" for help.
postgres=# select pg_wal_replay_resume();
pg_wal_replay_resume
----------------------
(1 row)
Immediately looking at your logs on the primary node so you can see a failing barman get-wal
but it’s fine, the first runs usually error and then start working. The important bit is: archive recovery complete
. Your recovery.signal
file should have been deleted.
[info] postgres | 2023-05-25 11:52:24.992 UTC [591] LOG: redo done at 0/B04BE58 system usage: CPU: user: 1.10 s, system: 1.23 s, elapsed: 59661.32 s
[info] postgres | 2023-05-25 11:52:24.992 UTC [591] LOG: last completed transaction was at log time 2023-05-24 18:22:52.856586+00
[info] postgres | ERROR: Remote 'barman get-wal' command has failed!
[info] postgres | 2023-05-25 11:52:25.254 UTC [591] LOG: selected new timeline ID: 3
[info] repmgrd | [2023-05-25 11:52:25] [NOTICE] node is now a standby, switching to standby monitoring
[info] postgres | 2023-05-25 11:52:25.542 UTC [591] LOG: restored log file "00000002.history" from archive
[info] postgres | 2023-05-25 11:52:25.542 UTC [591] LOG: archive recovery complete
[info] postgres | 2023-05-25 11:52:25.543 UTC [589] LOG: checkpoint starting: end-of-recovery immediate wait
[info] postgres | 2023-05-25 11:52:25.546 UTC [589] LOG: checkpoint complete: wrote 2 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.002 s, sync=0.001 s, total=0.003 s; sync files=2, longest=0.001 s, average=0.001 s; distance=2 kB, estimate=15016 kB
[info] postgres | 2023-05-25 11:52:25.547 UTC [565] LOG: database system is ready to accept connections
Now we need to fix your primary being a zombie. The reason for that is that you removed two nodes from your cluster but repmgr might still be waiting for them to come back. Exit your Postgres shell and let’s start using repmgr! You want to list your cluster and unregister any standby nodes that are gone. You need to do that from an unprivileged user such as postgres
.
root@5683d946c4018e:/# su postgres
postgres@5683d946c4018e:/$ repmgr -f /data/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
------------+----------------------------------+---------+---------------+-----------------------------------+----------+----------+----------+---------------------------------------------------------------------------------------------
471003240 | fdaa:0:3335:a7b:1f60:bb88:35f6:2 | standby | ? unreachable | ? fdaa:0:3335:a7b:145:c39f:6620:2 | gru | 100 | | host=fdaa:0:3335:a7b:1f60:bb88:35f6:2 port=5433 user=repmgr dbname=repmgr connect_timeout=5
919100672 | fdaa:0:3335:a7b:136:fa81:fb4d:2 | standby | ? unreachable | ? fdaa:0:3335:a7b:145:c39f:6620:2 | gru | 100 | | host=fdaa:0:3335:a7b:136:fa81:fb4d:2 port=5433 user=repmgr dbname=repmgr connect_timeout=5
1508227546 | fdaa:0:3335:a7b:145:c39f:6620:2 | primary | * running | | gru | 100 | 3 | host=fdaa:0:3335:a7b:145:c39f:6620:2 port=5433 user=repmgr dbname=repmgr connect_timeout=5
WARNING: following issues were detected
- unable to connect to node "fdaa:0:3335:a7b:1f60:bb88:35f6:2" (ID: 471003240)
- node "fdaa:0:3335:a7b:1f60:bb88:35f6:2" (ID: 471003240) is registered as an active standby but is unreachable
- unable to connect to node "fdaa:0:3335:a7b:136:fa81:fb4d:2" (ID: 919100672)
- node "fdaa:0:3335:a7b:136:fa81:fb4d:2" (ID: 919100672) is registered as an active standby but is unreachable
HINT: execute with --verbose option to see connection error messages
# You can see the deleted node IDs are 471003240 and 919100672
postgres@5683d946c4018e:/$ repmgr -f /data/repmgr.conf standby unregister --node-id 471003240
INFO: connecting to local standby
INFO: connecting to primary database
NOTICE: unregistering node 471003240
INFO: standby unregistration complete
postgres@5683d946c4018e:/$ repmgr -f /data/repmgr.conf standby unregister --node-id 919100672
INFO: connecting to local standby
INFO: connecting to primary database
NOTICE: unregistering node 919100672
INFO: standby unregistration complete
postgres@5683d946c4018e:/$ repmgr -f /data/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
------------+---------------------------------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------------------------------------------
1508227546 | fdaa:0:3335:a7b:145:c39f:6620:2 | primary | * running | | gru | 100 | 3 | host=fdaa:0:3335:a7b:145:c39f:6620:2 port=5433 user=repmgr dbname=repmgr connect_timeout=5
It should take a minute but your primary will back and your can now start horizontally scaling your DB again. Now let’s check your DB.
The database is back to normal. Let’s check barman!
$ fly pg barman check -a APP_NAME
Connecting to fdaa:0:3335:a7b:145:4fcf:c8f:2... complete
Server pg:
PostgreSQL: OK
superuser or standard user with backup privileges: OK
PostgreSQL streaming: OK
wal_level: OK
replication slot: OK
directories: OK
retention policy settings: OK
backup maximum age: OK (no last_backup_maximum_age provided)
backup minimum size: OK (29.6 MiB)
wal maximum age: OK (no last_wal_maximum_age provided)
wal size: OK (0 B)
compression settings: OK
failed backups: OK (there are 0 failed backups)
minimum redundancy requirements: OK (have 1 backups, expected at least 0)
pg_basebackup: OK
pg_basebackup compatible: OK
pg_basebackup supports tablespaces mapping: OK
systemid coherence: OK
pg_receivexlog: OK
pg_receivexlog compatible: OK
receive-wal running: OK
archiver errors: OK
Notes about the barman setup
We shipped to flyctl a few helper commands under fly pg barman
but you can always SSH into the barman machine and make full use of it, just make sure you add your server name which is pg
so commands like fly pg barman check
become barman check pg
inside your VM. Here’s the docs for all available commands.
Also two important files you can play with are barman.conf and barman.cron, both located at the /data
directory.
root@148ed461ce4d89:/# cat /data/barman.cron
* * * * * /usr/bin/barman cron
root@148ed461ce4d89:/# cat /data/barman.conf
[barman]
barman_user = root
barman_home = /data/barman.d
log_level = info
log_file = /data/barman.log
[pg]
description = "Fly.io Postgres Cluster"
conninfo = host=APP_NAME.internal user=repmgr dbname=postgres
streaming_conninfo = host=barman-8th.internal user=repmgr dbname=postgres
backup_method = postgres
streaming_archiver = on
slot_name = barman
create_slot = auto
retention_policy_mode = auto
retention_policy = RECOVERY WINDOW OF 7 days
wal_retention_policy = main
Feel free to edit them as you please, your node will not edit them anymore unless you delete these files and do a VM restart. If you change barman.cron don’t forget to run crontab /data/barman.cron
to update your crontab. You can see all available options for your barman.conf file on the official GitHub repo. You’ll want to take a special look at how Rentention Policies work for barman.
More base backups: to do or not to do
It’s your job to ensure your barman is running correctly. Having a really old backup will make restoring the WAL files take longer on a recovery scenario so it’s not a bad idea to have more frequent base backups, you can use barman.cron for that. That will of course require more volume storage so make sure your barman volume is up for the job and you can always extend it if needed.
What about Recovery Point Object?
It’s all about the WALs. As mentioned previously a non-write heavy DB should take long to ship WAL files, that could mean you could lose that data in a disaster recovery scenario. Postgres docs talk about that in more details, you can check out Continuous Archiving and Point-in-Time Recovery (PITR). Fortunately with WAL streaming (which we setup for you) barman will also get partial WAL files. If you’re interested on this we recommend you to take a look at barman docs and this post from 2ndQuadrant (they are barman maintainers).
See y’all next time!