Hey everyone,
We’re excited to announce that Postgres-flex now supports native WAL archiving to S3 compatible storage as well as WAL-based remote restores!
Background
With our previous implementation, WAL archiving could be enabled by spinning up a separate Barman machine that would only archive WAL to disk by default. This helped mitigate issues with primaries falling over due to disk capacity issues often stemming from WAL accumulation. While this effort was a step in the right direction, it had some major drawbacks…
In light of this, we decided to take different approach that works to eliminate the need for a separate machine and run WAL archiving next to the primary. This may require you to scale your resources up slightly to accommodate for the added overhead, but it should significantly simplify overall management and provide a much nicer experience for users.
Enabling WAL Archiving to Tigris
This feature has the following constraints:
- Flyctl
>= 0.2.89
- Postgres Flex
>= v0.0.53
- 512Mib Memory (Minimum)
New Postgres Provisions
To enable WAL archiving at provision time, simply specify in the —enable-backups
flag when running fly pg create
.
Example:
fly pg create --name shaun-wal-test --enable-backups
Existing Postgres Apps
Existing Postgres Apps can enable archiving with the follow command:
fly pg backups enable
Note - You will be required to perform a deploy after enabling to propagate the secret.
Behind the scenes, this will do the following
- Provision a new Tigris bucket based on the name of your App.
- Injects an
S3_ARCHIVE_CONFIG
secret containing the Tigris bucket credentials. - Enables WAL Archiving within the Flex implementation.
- Performs an initial base backup.
Listing Backups
Once you have enabled WAL archiving, you can now view your full backups by running the following:
fly pg backups list --app shaun-wal-test
|-----------------|------|--------|--------------------------|--------------------------|
| ID | NAME | STATUS | END TIME | BEGIN WAL |
|-----------------|------|--------|--------------------------|--------------------------|
| 20240715T151752 | | DONE | Mon Jul 15 15:17:54 2024 | 000000010000000000000002 |
|-----------------|------|--------|--------------------------|--------------------------|
On-demand backups
On-demand backups can be performed with or without an alias.
fly pg backups create --name pre-migration --app shaun-wal-test
Performing backup...
Backup completed successfully!
fly pg backups list --app shaun-wal-test
|-----------------|---------------|--------|--------------------------|--------------------------|
| ID | NAME | STATUS | END TIME | BEGIN WAL |
|-----------------|---------------|--------|--------------------------|--------------------------|
| 20240715T151752 | | DONE | Mon Jul 15 15:17:54 2024 | 000000010000000000000002 |
| 20240715T165707 | pre-migration | DONE | Mon Jul 15 16:57:08 2024 | 000000010000000000000007 |
|-----------------|---------------|--------|--------------------------|--------------------------|
WARNING - On-demand backups will currently reset the scheduled backup timer. This is a bug and should be fixed soon.
Remote restores
There are a few different ways you can initiate a remote restore.
Flyctl
flyctl pg backups restore --help
Performs a WAL-based restore into a new Postgres cluster.
Usage:
fly postgres backup restore <destination-app-name> [flags]
Flags:
-a, --app string Application name
-c, --config string Path to application configuration file
--detach Return immediately instead of monitoring deployment progress
-h, --help help for restore
--restore-target-inclusive Set to true to stop recovery after the specified time, or false to stop
before it (default true)
--restore-target-name string ID or alias of backup to restore.
--restore-target-time string RFC3339-formatted timestamp up to which recovery will proceed. Example:
2021-07-16T12:34:56Z
We have plans for exposing more restore target options in the future. For additional information on the specified restore targets:
Example:
fly pg backup restore <destination-app> --app <source-app>
Note - If you don’t specify a restore target, the restore will restore the latest data available.
Fly Dashboard
When WAL archiving is enabled, a new “Restore DB from backup” button will be surfaced within the Fly Dashboard.
The options are currently limited, but it should be functional for most use-cases.
WARNING - If the specified restore time is outside of the WAL range available, the restore process will fail!
Viewing/Updating WAL Archiving Configuration
As of right now, there are a couple different tunable’s that we expose.
Definitions
recovery_window (default: 7d)
Used as a retention policy. Backups older than the specified value will be pruned at regular intervals.
The shortest allowed recovery window is 1d
.
Units available:
- d - Days
- w - Weeks
- y - Years
minimum_redundancy (default 3)
The minimum number of backups that should always be available. Must be >= 0
full_backup_frequency (default: 24h)
The frequency in which full backups are taken. Must be >= 1h
archive_timeout (default: 60s)
Archiving typically happens once the WAL segment is full (16mib), however, this can cause issues for less active databases. So if the WAL segment doesn’t fill by the specified timeout, a wal-switch will occur and force an archive event.
WARNING - Having too short of a archive timeout can have performance implications and bloat object storage.
View your current configuration
First, you can view the existing settings by running the following curl command:
curl http://<app-name>.internal:5500/commands/admin/settings/view/barman -s | python3 -m json.tool
{
"result": {
"archive_timeout": "60s",
"full_backup_frequency": "24h",
"minimum_redundancy": "3",
"recovery_window": "7d"
}
}
Update your configuration
To update a configuration option, you can run:
curl http://<app-name>.internal:5500/commands/admin/settings/update/barman -d '{"recovery_window": "1d"}'
{"result":{"message":"Updated","restart_required":true}}
Then when we re-run the view command we can see the recovery window has been updated:
curl http://<app-name>.internal:5500/commands/admin/settings/view/barman -s | python3 -m json.tool
{
"result": {
"archive_timeout": "60s",
"full_backup_frequency": "24h",
"minimum_redundancy": "3",
"recovery_window": "1d"
}
}
This is just a starting point and we hope to improve this process in the near future. Change these settings currently require a cluster restart, but it should be possible to configure at runtime in the future.
Let us know what you think!
There’s a lot here and we already have quite a few improvements queued up for the next iteration. In any case, if you have any questions or feedback on this process, we’d love to hear them!