Point in Time Recovery for Postgres Flex using Barman

We are glad to announce our Postgres Flex image now comes with built in support for Barman for Point in Time Recovery (PITR) and flyctl now ships with helper methods to make setting up and using it easier.

tl;dr:

  • This is a customer-managed barman setup.
  • fly pg barman create will create a machine in your Postgres cluster with barman ready to use.
  • You can use fly pg barman commands to backup and restore your cluster.
  • Make sure your postgres-flex cluster has the latest image.
  • The barman volume will live on a volume that’s not on the same host as your cluster for safety

Why does it matters to you?

Fly.io comes with daily backups for volumes and that can already help with some disaster recovery scenarios when things go bad. But you probably don’t want to lose any data at all, most likely you want your Recovery Point Object to be zero and Barman offers that with minimal headache.

Why Barman?

Barman has great support for streaming replication to store WAL files and also works well with repmgr, which we use for Postgres Flex so it felt like a natural fit. Upon testing with Fly provisioned cluster it could not work better.

How does this work?

When you run fly pg barman create -a APP_NAME we will create a utility machine specifically for barman that will connect to your cluster using streaming protocol. Here’s a an image from Barman docs on how this works.

All connections will work out of the box. Just like you read-replicas use streaming replication, barman will also create a replication slot for itself so it can receive WAL files from your primary node.

Make sure your cluster image is up to date

You can use fly status to check if your cluster needs image updates to be able to support barman. Support started at v0.0.42.

Whenever you see APP_NAME on commands, rewrite as your database app name.

$ fly status -a APP_NAME

Machine "9185ee7a236283" flyio/postgres-flex:15.3 (v0.0.41) -> flyio/postgres-flex:15.3 (v0.0.42)

Run `flyctl image update` to migrate to the latest image version.
ID            	STATE  	ROLE   	REGION	CHECKS            	IMAGE                             	CREATED             	UPDATED
9185ee7a236283	started	primary	gru   	3 total, 3 passing	flyio/postgres-flex:15.3 (v0.0.41)	2023-05-25T13:59:50Z	2023-05-25T14:00:05Z

$ fly image update -a APP_NAME
The following changes will be applied to all Postgres machines.
Machines not running the official Postgres image will be skipped.

  	... // 84 identical lines
  	    }
  	  },
- 	  "image": "flyio/postgres-flex:15.3@sha256:814448113018b101e900c1fb875a75f98178f0978ac0715a7638bfb8a11bb59d",
+ 	  "image": "registry-1.docker.io/flyio/postgres-flex:15.3@sha256:c380a6108f9f49609d64e5e83a3117397ca3b5c3202d0bf0996883ec3dbb80c8",
  	  "restart": {
  	    "policy": "always"
  	... // 7 identical lines

? Apply changes? Yes
Identifying cluster role(s)
  Machine 9185ee7a236283: primary
Updating machine 9185ee7a236283
  Waiting for 9185ee7a236283 to become healthy (started, 3/3)
Machine 9185ee7a236283 updated successfully!
Postgres cluster has been successfully updated!

Creating your barman machine

All you need to do is run a single command. Two things you’d like to note here is that it’s recommended to create it on the same region as your primary node, in my case it was gru. The second thing is that your volume should be able to handle your Postgres base backups so it’s recommended that the volume is at least the same size as your primary.

$ fly pg barman create -a APP_NAME
? Select a region. Prefer closer to the primary Sao Paulo, Brazil (gru)
? Select VM size: performance-1x - CPU Kind: Performance, vCPUs: 1, Memory: 2GB
? Volume size (should be at least the size of the other volumes) 40
Provisioning volume with 40GB
Provisioning barman machine with image registry-1.docker.io/lubien/postgres-flex-barman:0.0.21
Waiting for machine to start...
Machine 148ed461ce4d89 is created

Fly will also make sure not to create the barman volume on the same host as your existing nodes, we use a flag when creating the volume for you that require an unique zone for this specific volume.

Checking if your setup is working

You can use fly pg barman check -a APP_NAME to get stats from your barman machine. If everything is fine you’re gonna see all checks as OK.

$ fly pg barman check -a APP_NAME
Connecting to fdaa:0:3335:a7b:145:4fcf:c8f:2... complete
Server pg:
	PostgreSQL: OK
	superuser or standard user with backup privileges: OK
	PostgreSQL streaming: OK
	wal_level: OK
	replication slot: OK
	directories: OK
	retention policy settings: OK
	backup maximum age: OK (no last_backup_maximum_age provided)
	backup minimum size: OK (0 B)
	wal maximum age: OK (no last_wal_maximum_age provided)
	wal size: OK (0 B)
	compression settings: OK
	failed backups: OK (there are 0 failed backups)
	minimum redundancy requirements: OK (have 0 backups, expected at least 0)
	pg_basebackup: OK
	pg_basebackup compatible: OK
	pg_basebackup supports tablespaces mapping: OK
	systemid coherence: OK (no system Id stored on disk)
	pg_receivexlog: OK
	pg_receivexlog compatible: OK
	receive-wal running: OK
	archiver errors: OK

One common issue that might happen is that if your app is not write-heavy your WAL files might take time to ship the first finished page since Postgres will wait until it fills an entire page (by default 16MB) before shipping it to barman. If that happens you’re going to see a check for WAL archive: FAILED like this:

$ fly pg barman check -a APP_NAME
Connecting to fdaa:0:3335:a7b:145:4fcf:c8f:2... complete
Server pg:
	WAL archive: FAILED (please make sure WAL shipping is setup)
	PostgreSQL: OK
	superuser or standard user with backup privileges: OK
	PostgreSQL streaming: OK
	directories: OK
	retention policy settings: OK
	backup maximum age: OK (no last_backup_maximum_age provided)
	backup minimum size: OK (0 B)
	wal maximum age: OK (no last_wal_maximum_age provided)
	wal size: OK (0 B)
	compression settings: OK
	failed backups: OK (there are 0 failed backups)
	minimum redundancy requirements: OK (have 0 backups, expected at least 0)
	pg_basebackup: OK
	pg_basebackup compatible: OK
	systemid coherence: OK (no system Id stored on disk)
	pg_receivexlog: OK
	pg_receivexlog compatible: OK
	receive-wal running: OK
	archiver errors: OK
Error: ssh shell: Process exited with status 1

In case you want to ensure barman gets synced immediately we prepared a command for you: fly pg barman switch-wal -a APP_NAME. That will make Postgres finish the current page and send to replicas and barman, then feel free to run the check command again.

$ fly pg barman switch-wal -a APP_NAME 
Connecting to fdaa:0:3335:a7b:145:4fcf:c8f:2... complete
The WAL file 000000010000000000000008 has been closed on server 'pg'
Waiting for the WAL file 000000010000000000000008 from server 'pg' (max: 30 seconds)
Processing xlog segments from streaming for pg
	000000010000000000000008

Preparing your first backup

For barman to be able to restore your WAL files up to a date you want you first need to take a Base Backup. Base backups are copies from your PostgreSQL as is. These are needed to apply PITR because whenever you recover your database you can choose a base backup to be a starting point of the recovery and then barman will ship the missing WAL files from that point forward. Let’s start simple: to make your first base backup run this command.

$ fly pg barman backup -a APP_NAME 
Connecting to fdaa:0:3335:a7b:145:4fcf:c8f:2... complete
Starting backup using postgres method for server pg in /data/barman.d/pg/base/20230524T173406
Backup start at LSN: 0/9000148 (000000010000000000000009, 00000148)
Starting backup copy via pg_basebackup for 20230524T173406
Copy done (time: less than one second)
Finalising the backup.
This is the first backup for server pg
WAL segments preceding the current backup have been found:
	000000010000000000000006 from server pg has been removed
	000000010000000000000007 from server pg has been removed
	000000010000000000000008 from server pg has been removed
Backup size: 29.6 MiB
Backup end at LSN: 0/B000000 (00000001000000000000000A, 00000000)
Backup completed (start time: 2023-05-24 17:34:06.678348, elapsed time: less than one second)
Processing xlog segments from streaming for pg
	000000010000000000000009
	00000001000000000000000A

You can verify that everything is okay using list-backups and show-backup:

$ fly pg barman list-backup -a APP_NAME
Connecting to fdaa:0:3335:a7b:145:4fcf:c8f:2... complete
pg 20230524T173406 - Wed May 24 17:34:07 2023 - Size: 45.6 MiB - WAL Size: 0 B

$ fly pg barman show-backup 20230524T173406 -a APP_NAME
Connecting to fdaa:0:3335:a7b:145:4fcf:c8f:2... complete
Backup 20230524T173406:
  Server Name            : pg
  System Id              : 7236768638320013868
  Status                 : DONE
  PostgreSQL Version     : 150002
  PGDATA directory       : /data/postgresql

  Base backup information:
    Disk usage           : 29.6 MiB (45.6 MiB with WALs)
    Incremental size     : 29.6 MiB (-0.00%)
    Timeline             : 1
    Begin WAL            : 00000001000000000000000A
    End WAL              : 00000001000000000000000A
    WAL number           : 1
    Begin time           : 2023-05-24 17:34:06.676647+00:00
    End time             : 2023-05-24 17:34:07.448928+00:00
    Copy time            : less than one second
    Estimated throughput : 41.3 MiB/s
    Begin Offset         : 40
    End Offset           : 0
    Begin LSN            : 0/A000028
    End LSN              : 0/B000000

  WAL information:
    No of files          : 0
    Disk usage           : 0 B
    Last available       : 00000001000000000000000A

  Catalog information:
    Retention Policy     : VALID
    Previous Backup      : - (this is the oldest base backup)
    Next Backup          : - (this is the latest base backup)

If you list-backup says WAITING_FOR_WALS it’s fine, it’s the same thing as described above when you first setup barman, your database might have not finished a page yet and barman will receive it as soon as possible. The important bit about show-backup is that you can now perform PITR for any date starting from your first backup’s End time. In my case, at least from 2023-05-24 17:34:07.448928+00:00.

Performing a disaster recover

Everything went wrong and now your database is busted. Let’s see how we can recover. For this example we are going to create a test database and will insert some test data, wait a few seconds and then insert more test data. Do note I’m inserting 1000 entries each run so my WAL page is filled and sent to barman.

$ fly pg connect -a APP_NAME

postgres=# CREATE TABLE IF NOT EXISTS testtbl (time TIMESTAMP NULL);
CREATE TABLE

postgres=# INSERT INTO testtbl SELECT NOW() FROM generate_series(1, 1000);
INSERT 0 1000

# wait and repeat the command above

postgres=# select distinct * from testtbl order by time asc; \q\n
            time
----------------------------
 2023-05-24 18:20:58.035307
 2023-05-24 18:21:43.332269
 2023-05-24 18:22:52.374381
 2023-05-24 18:25:09.704642

postgres=# drop table testtbl; 
DROP TABLE

Oh no, my table is gone. Let’s check backups. You can use fly barman show-backup latest, here’s the simplified output:

$ fly pg barman show-backup latest -a APP_NAME
Connecting to fdaa:0:3335:a7b:145:4fcf:c8f:2... complete
Backup 20230524T173406:
  Server Name            : pg
  System Id              : 7236768638320013868
  Status                 : DONE

  Base backup information:
    # more data
    Begin time           : 2023-05-24 17:34:06.676647+00:00
    End time             : 2023-05-24 17:34:07.448928+00:00
    # more data

You just happen to know the table was dropped after 2023-05-24 18:29:09.704642 and your base backup ended before that so you can recover from a time I know is safe which could be 2023-05-24 18:28:09.704642.

First, you want to destroy your replicas. We recommend you to use the monitoring page on our fly.io dashboard to know which machines are replicas based on health checks. The URL should se fly.io/apps/APP_NAME/monitoring. Make sure only the primary and barman nodes are left.

From the image above I can see I need to keep machines 5683d946c4018e (primary) and 148ed461ce4d89 (barman). The reason for that is we will need to restart the primary node to restore it and if repmgr catches the fact that the primary is offline it will elect a new primary and could lead to a split brain situation and you don’t want that to plus it’s quite easy to scale your cluster back.

$ fly machine destroy --force 148e22db120789
machine 148e22db120789 was found and is currently in started state, attempting to destroy...
unregistering postgres member 'fdaa:0:3335:a7b:1f60:bb88:35f6:2' from the cluster... WARN The running flyctl agent (v0.0.0-1684941583+dev) is older than the current flyctl (v0.1.10).
148e22db120789 has been destroyed

$ fly machine destroy --force e2865ed3c32e86
machine e2865ed3c32e86 was found and is currently in started state, attempting to destroy...
unregistering postgres member 'fdaa:0:3335:a7b:136:fa81:fb4d:2' from the cluster... (success)
e2865ed3c32e86 has been destroyed

Now it’s time to recover. We shipped a command just for that:

$ flyctl postgres barman recover --help

Recover primary database with a barman backup

Usage:
  flyctl postgres barman recover <primary machine ID> [flags]

Flags:
  -a, --app string           Application name
  -b, --backup-id string     choose one backup ID. Default: latest (default "latest")
  -h, --help                 help for recover
  -T, --target-time string   choose a target time for PITR. Example: "2023-05-16 20:55:05.958774+00:00". Default: last WAL file on barman

Global Flags:
  -t, --access-token string   Fly API Access Token
      --verbose               Verbose output

The 3 things you care about here are the primary machine ID, the backup ID and target time. You know from previous steps that our primary machine ID is 5683d946c4018e, you also know you can use latest for backup ID (you can omit the flag if you want). and the target time we decided upon was 2023-05-24 18:28:09.704642. Let’s put that all together and see what happens.

IMPORTANT: at all times keep your logs visible. There’s a million things that can happen doing this, you want to make sure you know when your Postgres say something about your recovery state. I recommend you to open the monitoring page on Focus Mode for your primary VM. The URL should be like this: https://fly.io/apps/APP_NAME/monitoring?instance=MACHINE_ID, just click on any machine ID to start it.

$ barman recover 5683d946c4018e -b latest --target-time "2023-05-24 18:28:09.704642"
2023/05/24 15:55:07 Barman server will run the following command: barman recover pg --get-wal latest /data/postgresql --remote-ssh-command "ssh root@5683d946c4018e.vm.APP_NAME.internal -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null" --target-time "2023-05-24 18:28:09.704642"
Connecting to fdaa:0:3335:a7b:145:4fcf:c8f:2... complete
Starting remote restore for server pg using backup 20230524T173406
Destination directory: /data/postgresql
Remote command: ssh root@5683d946c4018e.vm.APP_NAME.internal -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
Doing PITR. Recovery target time: '2023-05-24 18:28:09.704642+00:00'
Copying the base backup.
Generating recovery configuration
Identify dangerous settings in destination directory.

WARNING
You are required to review the following options as potentially dangerous

postgresql.conf line 815: include = 'postgresql.internal.conf'
postgresql.conf line 816: include = 'postgresql.user.conf'

WARNING: 'get-wal' is in the specified 'recovery_options'.
Before you start up the PostgreSQL server, please review the postgresql.auto.conf file
inside the target directory. Make sure that 'restore_command' can be executed by the PostgreSQL user.

Recovery completed (start time: 2023-05-24 18:55:14.768077+00:00, elapsed time: 1 second)
Your PostgreSQL server has been successfully prepared for recovery!

At this point your primary node should have 2 failing health checks because your Postgres will stop, that’s fine. Let’s take a peek on the VM to see what happened!

# select your primary
$ fly ssh console -s --pty -C "/bin/bash" -a APP_NAME
? Select VM: gru: 5683d946c4018e fdaa:0:3335:a7b:145:c39f:6620:2 withered-resonance-4917
Connecting to fdaa:0:3335:a7b:145:c39f:6620:2... complete

root@5683d946c4018e:/# cd data/postgresql/

root@5683d946c4018e:/data/postgresql# cat postgresql.auto.conf
# Do not edit this file manually!
# It will be overwritten by the ALTER SYSTEM command.

# The 'barman-wal-restore' command is provided in the 'barman-cli' package
restore_command = 'barman-wal-restore -P -U root 148ed461ce4d89 pg %f %p'
recovery_target_time = '2023-05-24 18:28:09.704642'

root@5683d946c4018e:/data/postgresql# ls -la barman_wal/
total 12
drwxr-xr-x  3 root     root     4096 May 24 18:55 .
drwx------ 20 postgres postgres 4096 May 24 18:55 ..
drwxr-xr-x  2 root     root     4096 May 24 18:55 archive_status

This postgresql.auto.conf file is temporary and will be removed as soon as the backup is finished. You can also verify there’s a recovery.signal file there. We need to restart this VM to start the backup process. Exit the console and let’s do this. Keep looking your primary node logs, they will show some errors at first but will eventually reach a stable recovery. You want to use fly machine restart to restart only a single machine instead of fly pg restart which also would restart barman. You also want to --skip-health-checks since it’s likely your DB is going to have failing checks until recovery is done.

$ fly m restart 5683d946c4018e --skip-health-checks -a APP_NAME
Restarting machine 5683d946c4018e
Machine 5683d946c4018e restarted successfully!

Now if you’re paying attention to your logs you will notice that it’s asking you to run pg_wal_replay_resume() and the primary node became a zombie, that’s fine.

[info] postgres | 2023-05-24 19:02:51.942 UTC [594] LOG: consistent recovery state reached at 0/A000100
[info] postgres | 2023-05-24 19:02:51.942 UTC [554] LOG: database system is ready to accept read-only connections
[info] postgres | 2023-05-24 19:02:51.948 UTC [594] LOG: recovery stopping before commit of transaction 787, time 2023-05-24 18:40:11.704292+00
[info] postgres | 2023-05-24 19:02:51.948 UTC [594] LOG: pausing at the end of recovery
[info] postgres | 2023-05-24 19:02:51.948 UTC [594] HINT: Execute pg_wal_replay_resume() to promote.

Let’s fix this. We first need to once more SSH into the primary node and go to /data/postgres. You can see that the recovery files and folder are still there. Also if you ls /data you’re going to see the zombie.lock file.

# select your primary
$ fly ssh console -s --pty -C "/bin/bash" -a APP_NAME
? Select VM: gru: 5683d946c4018e fdaa:0:3335:a7b:145:c39f:6620:2 withered-resonance-4917
Connecting to fdaa:0:3335:a7b:145:c39f:6620:2... complete

root@5683d946c4018e:/# cd data/postgresql/

root@5683d946c4018e:/data/postgresql# ls -lah
total 364K
drwx------ 20 postgres postgres 4.0K May 24 19:18 .
drwxr-xr-x  5 postgres postgres 4.0K May 24 19:18 ..
drwxr-xr-x  3 postgres postgres 4.0K May 24 19:17 barman_wal
-rw-------  1 postgres postgres  290 May 24 19:17 postgresql.auto.conf
-rw-r--r--  1 postgres postgres    0 May 24 19:17 recovery.signal
# more files omitted

root@5683d946c4018e:/data/postgresql# ls /data
flypg.internal.conf  lost+found  readonly.lock	repmgr.internal.conf  zombie.lock
flypg.user.conf      postgresql  repmgr.conf	repmgr.user.conf

Now we need to resume our database system. Usually you could connect to it from port 5432, but since it’s on a zombie state, this node is treated as a replica so we need to connect through port 5433 and run pg_wal_replay_resume().

root@5683d946c4018e:/data/postgresql# psql "postgres://repmgr:$REPL_PASSWORD@$FLY_APP_NAME.internal:5433/postgres"
psql (15.2 (Debian 15.2-1.pgdg110+1))
Type "help" for help.

postgres=# select pg_wal_replay_resume();
 pg_wal_replay_resume
----------------------

(1 row)

Immediately looking at your logs on the primary node so you can see a failing barman get-wal but it’s fine, the first runs usually error and then start working. The important bit is: archive recovery complete. Your recovery.signal file should have been deleted.

[info] postgres | 2023-05-25 11:52:24.992 UTC [591] LOG: redo done at 0/B04BE58 system usage: CPU: user: 1.10 s, system: 1.23 s, elapsed: 59661.32 s
[info] postgres | 2023-05-25 11:52:24.992 UTC [591] LOG: last completed transaction was at log time 2023-05-24 18:22:52.856586+00
[info] postgres | ERROR: Remote 'barman get-wal' command has failed!
[info] postgres | 2023-05-25 11:52:25.254 UTC [591] LOG: selected new timeline ID: 3
[info] repmgrd | [2023-05-25 11:52:25] [NOTICE] node is now a standby, switching to standby monitoring
[info] postgres | 2023-05-25 11:52:25.542 UTC [591] LOG: restored log file "00000002.history" from archive
[info] postgres | 2023-05-25 11:52:25.542 UTC [591] LOG: archive recovery complete
[info] postgres | 2023-05-25 11:52:25.543 UTC [589] LOG: checkpoint starting: end-of-recovery immediate wait
[info] postgres | 2023-05-25 11:52:25.546 UTC [589] LOG: checkpoint complete: wrote 2 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.002 s, sync=0.001 s, total=0.003 s; sync files=2, longest=0.001 s, average=0.001 s; distance=2 kB, estimate=15016 kB
[info] postgres | 2023-05-25 11:52:25.547 UTC [565] LOG: database system is ready to accept connections

Now we need to fix your primary being a zombie. The reason for that is that you removed two nodes from your cluster but repmgr might still be waiting for them to come back. Exit your Postgres shell and let’s start using repmgr! You want to list your cluster and unregister any standby nodes that are gone. You need to do that from an unprivileged user such as postgres.

root@5683d946c4018e:/# su postgres

postgres@5683d946c4018e:/$ repmgr -f /data/repmgr.conf cluster show
 ID         | Name                             | Role    | Status        | Upstream                          | Location | Priority | Timeline | Connection string
------------+----------------------------------+---------+---------------+-----------------------------------+----------+----------+----------+---------------------------------------------------------------------------------------------
 471003240  | fdaa:0:3335:a7b:1f60:bb88:35f6:2 | standby | ? unreachable | ? fdaa:0:3335:a7b:145:c39f:6620:2 | gru      | 100      |          | host=fdaa:0:3335:a7b:1f60:bb88:35f6:2 port=5433 user=repmgr dbname=repmgr connect_timeout=5
 919100672  | fdaa:0:3335:a7b:136:fa81:fb4d:2  | standby | ? unreachable | ? fdaa:0:3335:a7b:145:c39f:6620:2 | gru      | 100      |          | host=fdaa:0:3335:a7b:136:fa81:fb4d:2 port=5433 user=repmgr dbname=repmgr connect_timeout=5
 1508227546 | fdaa:0:3335:a7b:145:c39f:6620:2  | primary | * running     |                                   | gru      | 100      | 3        | host=fdaa:0:3335:a7b:145:c39f:6620:2 port=5433 user=repmgr dbname=repmgr connect_timeout=5

WARNING: following issues were detected
  - unable to connect to node "fdaa:0:3335:a7b:1f60:bb88:35f6:2" (ID: 471003240)
  - node "fdaa:0:3335:a7b:1f60:bb88:35f6:2" (ID: 471003240) is registered as an active standby but is unreachable
  - unable to connect to node "fdaa:0:3335:a7b:136:fa81:fb4d:2" (ID: 919100672)
  - node "fdaa:0:3335:a7b:136:fa81:fb4d:2" (ID: 919100672) is registered as an active standby but is unreachable

HINT: execute with --verbose option to see connection error messages

# You can see the deleted node IDs are 471003240 and 919100672

postgres@5683d946c4018e:/$ repmgr -f /data/repmgr.conf standby unregister --node-id 471003240
INFO: connecting to local standby
INFO: connecting to primary database
NOTICE: unregistering node 471003240
INFO: standby unregistration complete

postgres@5683d946c4018e:/$ repmgr -f /data/repmgr.conf standby unregister --node-id 919100672
INFO: connecting to local standby
INFO: connecting to primary database
NOTICE: unregistering node 919100672
INFO: standby unregistration complete

postgres@5683d946c4018e:/$ repmgr -f /data/repmgr.conf cluster show
 ID         | Name                            | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
------------+---------------------------------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------------------------------------------
 1508227546 | fdaa:0:3335:a7b:145:c39f:6620:2 | primary | * running |          | gru      | 100      | 3        | host=fdaa:0:3335:a7b:145:c39f:6620:2 port=5433 user=repmgr dbname=repmgr connect_timeout=5

It should take a minute but your primary will back and your can now start horizontally scaling your DB again. Now let’s check your DB.

The database is back to normal. Let’s check barman!

$ fly pg barman check -a APP_NAME
Connecting to fdaa:0:3335:a7b:145:4fcf:c8f:2... complete
Server pg:
	PostgreSQL: OK
	superuser or standard user with backup privileges: OK
	PostgreSQL streaming: OK
	wal_level: OK
	replication slot: OK
	directories: OK
	retention policy settings: OK
	backup maximum age: OK (no last_backup_maximum_age provided)
	backup minimum size: OK (29.6 MiB)
	wal maximum age: OK (no last_wal_maximum_age provided)
	wal size: OK (0 B)
	compression settings: OK
	failed backups: OK (there are 0 failed backups)
	minimum redundancy requirements: OK (have 1 backups, expected at least 0)
	pg_basebackup: OK
	pg_basebackup compatible: OK
	pg_basebackup supports tablespaces mapping: OK
	systemid coherence: OK
	pg_receivexlog: OK
	pg_receivexlog compatible: OK
	receive-wal running: OK
	archiver errors: OK

Notes about the barman setup

We shipped to flyctl a few helper commands under fly pg barman but you can always SSH into the barman machine and make full use of it, just make sure you add your server name which is pg so commands like fly pg barman check become barman check pg inside your VM. Here’s the docs for all available commands.

Also two important files you can play with are barman.conf and barman.cron, both located at the /data directory.

root@148ed461ce4d89:/# cat /data/barman.cron
* * * * * /usr/bin/barman cron

root@148ed461ce4d89:/# cat /data/barman.conf
[barman]
barman_user = root
barman_home = /data/barman.d
log_level = info
log_file = /data/barman.log

[pg]
description =  "Fly.io Postgres Cluster"
conninfo = host=APP_NAME.internal user=repmgr dbname=postgres
streaming_conninfo = host=barman-8th.internal user=repmgr dbname=postgres
backup_method = postgres
streaming_archiver = on
slot_name = barman
create_slot = auto
retention_policy_mode = auto
retention_policy = RECOVERY WINDOW OF 7 days
wal_retention_policy = main

Feel free to edit them as you please, your node will not edit them anymore unless you delete these files and do a VM restart. If you change barman.cron don’t forget to run crontab /data/barman.cron to update your crontab. You can see all available options for your barman.conf file on the official GitHub repo. You’ll want to take a special look at how Rentention Policies work for barman.

More base backups: to do or not to do

It’s your job to ensure your barman is running correctly. Having a really old backup will make restoring the WAL files take longer on a recovery scenario so it’s not a bad idea to have more frequent base backups, you can use barman.cron for that. That will of course require more volume storage so make sure your barman volume is up for the job and you can always extend it if needed.

What about Recovery Point Object?

It’s all about the WALs. As mentioned previously a non-write heavy DB should take long to ship WAL files, that could mean you could lose that data in a disaster recovery scenario. Postgres docs talk about that in more details, you can check out Continuous Archiving and Point-in-Time Recovery (PITR). Fortunately with WAL streaming (which we setup for you) barman will also get partial WAL files. If you’re interested on this we recommend you to take a look at barman docs and this post from 2ndQuadrant (they are barman maintainers).

See y’all next time!

20 Likes

This is brilliant, works seamlessly too. Only thing missing is a built-in method to then export backups (and possibly WAL archives) to external storage such as an S3-compatible object store (ideally archived and compressed) – though I guess something that could be done by extending the barman image.

1 Like

Barman will come with BARMAN-CLOUD-BACKUP(1) Barman User manuals | Version 2.12

One possible solution is setting a script under data folder so it persists and adding it to the cron file. I’d recommend using secrets for your S3 credentials

Here’s an article from 2ndQuadrant: Barman Cloud - Part 2: Cloud Backup - 2ndQuadrant | PostgreSQL

1 Like

That would be perfect, but it appears not be installed on the image. Additionally running it returns a few errors:

Firstly without specifying the pguser or pghost:

$ barman-cloud-backup -e aws:kms --immediate-checkpoint -z s3://$s3_bucket/barman/ pg

2023-07-12 12:22:16,053 [1232] ERROR: Cannot connect to postgres: connection to server on socket “/var/run/postgresql/.s.PGSQL.5432” failed: No such file or directory

After specifying user and host:

barman-cloud-backup -U repmgr -h $FLY_APP.internal -e aws:kms --immediate-checkpoint -z s3://$s3_bucket/barman/ pg
2023-07-12 12:23:11,045 [1251] ERROR: Backup failed uploading data ([Errno 2] No such file or directory: ‘/data/postgresql/global/pg_control’)
Exception ignored in: <function _Stream.del at 0x7f2cc920d4c0>
Traceback (most recent call last):
File “/usr/lib/python3.9/tarfile.py”, line 410, in del
self.close()
File “/usr/lib/python3.9/tarfile.py”, line 460, in close
self.fileobj.write(self.buf)
File “/usr/lib/python3/dist-packages/barman/cloud.py”, line 212, in write
self.buffer = self._buffer()
File “/usr/lib/python3.9/tempfile.py”, line 680, in NamedTemporaryFile
(fd, name) = _mkstemp_inner(dir, prefix, suffix, flags, output_type)
File “/usr/lib/python3.9/tempfile.py”, line 390, in _mkstemp_inner
fd = _os.open(file, flags, 0o600)
FileNotFoundError: [Errno 2] No such file or directory: ‘/tmp/barman-cloud-backup-ie8o5tv3/barman-upload-6kusrzce.part’

Sounds like it might be failing because it expects barman to be on one of the PG cluster hosts?

Reading the docs further seems to indicate this is the case:

This script can be used to perform a backup of a local PostgreSQL instance and ship the resulting tarball(s) to the Cloud.

Those are interesting points especially since this expects that barman config file is on one of the PG nodes which is an interesting challenge since we don’t really sync barman.conf on all machines, just on the barman machine itself.

One workaround I could think of is just copying the barman.conf to a replica and running the backup there but you’d probably not have an easy way to setup cron there and ensure its going to be kept after restarts. Fortunately, I’ve been hearing there’s some work on platform features that could help this in the future. Stay tuned.

1 Like

Indeed this was something I noticed when I was looking at rolling my own backup solution when pg-flex was introduced. Pretty much all pg backup solutions need to be directly attached to the postgres tablespace, so would likely need to be available in the pg-flex image.

1 Like

This is awesome! Fantastic, really :purple_heart:

1 Like

Great guide! Was able to get it up and running following these steps.

I’m trying to enable compression so that the WAL files take up less space on the Barman volume.

I set the following in my barman.conf:

 [barman]
barman_user = root
barman_home = /data/barman.d
log_level = info
log_file = /data/barman.log
compression = gzip

But not sure if it did the trick. Anyone else set this up yet and have some guidance?

1 Like

@lubien Is there a way to upgrade to Barman 3.7.0 to do backups to S3?

6 Likes

Would be great if barman would do S3 transfers on its own :raised_hands:t2:

2 Likes

@oleksify just shipped a new version of fly postgres that includes Barman 3.7.0

Make sure to upgrade your flyctl to at least v0.1.87, we found a bug that prevented barman machines from being updated.

Run fly image update -a YOUR_DB_APP_NAME.

3 Likes

Oddly, even after update there’s no barman-cloud-backup cli on the machine. I wonder if I should manually install it?

In the latest image (0.0.45), the barman-cloud-backup still does not exist.

1 Like

Any update on this? Barman 3.7 with latest Postgres Image here, but cloud-backup doesn’t work.

Apologies for that, it seems our sync job did not catch the latest image then I sent a new one:

fly image update -a DB_APP_NAME should work now. Just tried it:

~/workspace/barmanquicktest
[I] ➜ fly image update
The following changes will be applied to all Postgres machines.
Machines not running the official Postgres image will be skipped.

  	... // 47 identical lines
  	    }
  	  },
- 	  "image": "registry-1.docker.io/flyio/postgres-flex:15.3@sha256:5e5fc53decb051f69b0850f0f5d137c92343fcd1131ec413015e5260620947ac",
+ 	  "image": "registry-1.docker.io/flyio/postgres-flex:15.3@sha256:44b698752cf113110f2fa72443d7fe452b48228aafbb0d93045ef1e3282360a6",
  	  "restart": {
  	    "policy": "always"
  	... // 7 identical lines

? Apply changes? Yes
Identifying cluster role(s)
  Machine 4d891627f45e38: barman
  Machine 91857502a10148: primary
Updating machine 4d891627f45e38
Machine 4d891627f45e38 updated successfully!
Updating machine 91857502a10148
  Waiting for 91857502a10148 to become healthy (started, 3/3)
Machine 91857502a10148 updated successfully!
Postgres cluster has been successfully updated!

~/workspace/barmanquicktest took 1m 15s
[I] ➜ fly ssh console -s
? Select VM: gru: 4d891627f45e38 fdaa:0:3335:a7b:145:1ef6:4fb9:2 barman (barman)
Connecting to fdaa:0:3335:a7b:145:1ef6:4fb9:2... complete
root@4d891627f45e38:/# barman-cloud-backup --version
barman-cloud-backup 3.9.0
1 Like

Works like a charm, thank you!

1 Like

I think what’s really missing here right now is the fact that the Fly secrets aren’t available to the barman cron process – so if using pre_archive_script to relay WALs to offsite storage, it doesn’t work if your credentials are stored in fly secrets.

I’ve tried various things here to figure out a better strategy to push backups and WALs off-site.

  1. Update postgresql.conf to set archive_mode and archive_command to use barman-cloud-wal-archive – This isn’t ideal because it requires manually modifying postgresql.conf.
  2. Find a way to pass the env vars for fly secrets down to the barman cron process – I’ve had no luck with this so far though, this a common issue with cron tasks on fly.

Additionally barman on fly comes configured using a simple, but ultimately inefficient configuration which requires full, uncompressed backups on a local volume for the retention window. This means for a 30 day retention window your barman volume needs to be over 30× the size of your postgres volumes.

I’m considering that barman as configured by fly isn’t really production suitable as even a very small database will cost a lot to keep backups for as there’s no incremental backup support.

This was all much easier with wal-g which “just worked” and was able to take backups and archive compressed WALs without a massive local volume or extra manual configuration.

5 Likes

Anyone run into ssh trouble when doing this? When I try to run the recover command

fly pg barman recover <postgres_machine_id> -a <app_name> -b latest --target-time "2024-03-08 14:17:32.992288"

I get

Connecting to <ipv6>... complete
ERROR: Unable to connect to the target host using the command 'ssh root@<postgres_machine_id>.vm.<app_name>.internal -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null'

I get the same error if I ssh into the barman instance and run the ssh command by hand:

root@<barman_machine_id>:/# ssh root@<postgres_machine_id>.vm.<app_name>.internal -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null

Except it also spits out this info:

Warning: Permanently added '<postgres_machine_id>.vm.<app_name>.internal' (ED25519) to the list of known hosts.
root@<postgres_machine_id>.vm.<app_name>.internal: Permission denied (publickey).

I tried copying the contents of /root/.ssh/id_rsa-cert.pub on the barman machine into the /root/.ssh/authorized_keys file on the postgres machine, but I get the same error. I even tried generating a new ssh key pair, but same error.

Hi… If you pan through fly logs -a db-app-name, do you see not a cert complaints from the SSH daemon?

(The overall setting is similar, but not identical.)

Yes. I read through the linked article and tried to ssh from the barman machine into the primary using the directions in that thread but got:

hostkeys_find_by_key_hostfile: hostkeys_foreach failed for /data/.ssh/known_hosts: Permission denied
...
Failed to add the host to the list of known hosts (/data/.ssh/known_hosts).