Fresh Produce - Enhanced WAL Archiving and Remote Restores

Hey everyone,

We’re excited to announce that Postgres-flex now supports native WAL archiving to S3 compatible storage as well as WAL-based remote restores!

Background

With our previous implementation, WAL archiving could be enabled by spinning up a separate Barman machine that would only archive WAL to disk by default. This helped mitigate issues with primaries falling over due to disk capacity issues often stemming from WAL accumulation. While this effort was a step in the right direction, it had some major drawbacks…

In light of this, we decided to take different approach that works to eliminate the need for a separate machine and run WAL archiving next to the primary. This may require you to scale your resources up slightly to accommodate for the added overhead, but it should significantly simplify overall management and provide a much nicer experience for users.

Enabling WAL Archiving to Tigris

This feature has the following constraints:

  • Flyctl >= 0.2.89
  • Postgres Flex >= v0.0.53
  • 512Mib Memory (Minimum)

New Postgres Provisions

To enable WAL archiving at provision time, simply specify in the —enable-backups flag when running fly pg create.

Example:

fly pg create --name shaun-wal-test --enable-backups

Existing Postgres Apps

Existing Postgres Apps can enable archiving with the follow command:

fly pg backups enable

Note - You will be required to perform a deploy after enabling to propagate the secret.

Behind the scenes, this will do the following

  • Provision a new Tigris bucket based on the name of your App.
  • Injects an S3_ARCHIVE_CONFIG secret containing the Tigris bucket credentials.
  • Enables WAL Archiving within the Flex implementation.
  • Performs an initial base backup.

Listing Backups

Once you have enabled WAL archiving, you can now view your full backups by running the following:

fly pg backups list --app shaun-wal-test
|-----------------|------|--------|--------------------------|--------------------------|
| ID              | NAME | STATUS | END TIME                 | BEGIN WAL                |
|-----------------|------|--------|--------------------------|--------------------------|
| 20240715T151752 |      | DONE   | Mon Jul 15 15:17:54 2024 | 000000010000000000000002 |
|-----------------|------|--------|--------------------------|--------------------------|

On-demand backups

On-demand backups can be performed with or without an alias.

fly pg backups create --name pre-migration --app shaun-wal-test
Performing backup...
Backup completed successfully!
fly pg backups list --app shaun-wal-test
|-----------------|---------------|--------|--------------------------|--------------------------|
| ID              | NAME          | STATUS | END TIME                 | BEGIN WAL                |
|-----------------|---------------|--------|--------------------------|--------------------------|
| 20240715T151752 |               | DONE   | Mon Jul 15 15:17:54 2024 | 000000010000000000000002 |
| 20240715T165707 | pre-migration | DONE   | Mon Jul 15 16:57:08 2024 | 000000010000000000000007 |
|-----------------|---------------|--------|--------------------------|--------------------------|

WARNING - On-demand backups will currently reset the scheduled backup timer. This is a bug and should be fixed soon.

Remote restores

There are a few different ways you can initiate a remote restore.

Flyctl

flyctl pg backups restore --help
Performs a WAL-based restore into a new Postgres cluster.

Usage:
  fly postgres backup restore <destination-app-name> [flags]

Flags:
  -a, --app string                   Application name
  -c, --config string                Path to application configuration file
      --detach                       Return immediately instead of monitoring deployment progress
  -h, --help                         help for restore
      --restore-target-inclusive     Set to true to stop recovery after the specified time, or false to stop
                                     before it (default true)
      --restore-target-name string   ID or alias of backup to restore.
      --restore-target-time string   RFC3339-formatted timestamp up to which recovery will proceed. Example:
                                     2021-07-16T12:34:56Z

We have plans for exposing more restore target options in the future. For additional information on the specified restore targets:

Example:

fly pg backup restore <destination-app> --app <source-app> 

Note - If you don’t specify a restore target, the restore will restore the latest data available.

Fly Dashboard

When WAL archiving is enabled, a new “Restore DB from backup” button will be surfaced within the Fly Dashboard.

The options are currently limited, but it should be functional for most use-cases.

WARNING - If the specified restore time is outside of the WAL range available, the restore process will fail!

Viewing/Updating WAL Archiving Configuration

As of right now, there are a couple different tunable’s that we expose.

Definitions

recovery_window (default: 7d)
Used as a retention policy. Backups older than the specified value will be pruned at regular intervals.

The shortest allowed recovery window is 1d.

Units available:

  • d - Days
  • w - Weeks
  • y - Years

minimum_redundancy (default 3)
The minimum number of backups that should always be available. Must be >= 0

full_backup_frequency (default: 24h)
The frequency in which full backups are taken. Must be >= 1h

archive_timeout (default: 60s)
Archiving typically happens once the WAL segment is full (16mib), however, this can cause issues for less active databases. So if the WAL segment doesn’t fill by the specified timeout, a wal-switch will occur and force an archive event.

WARNING - Having too short of a archive timeout can have performance implications and bloat object storage.

View your current configuration

First, you can view the existing settings by running the following curl command:

curl http://<app-name>.internal:5500/commands/admin/settings/view/barman -s | python3 -m json.tool
{
    "result": {
        "archive_timeout": "60s",
        "full_backup_frequency": "24h",
        "minimum_redundancy": "3",
        "recovery_window": "7d"
    }
}

Update your configuration

To update a configuration option, you can run:

curl http://<app-name>.internal:5500/commands/admin/settings/update/barman -d '{"recovery_window": "1d"}'
{"result":{"message":"Updated","restart_required":true}}

Then when we re-run the view command we can see the recovery window has been updated:

curl http://<app-name>.internal:5500/commands/admin/settings/view/barman -s | python3 -m json.tool
{
    "result": {
        "archive_timeout": "60s",
        "full_backup_frequency": "24h",
        "minimum_redundancy": "3",
        "recovery_window": "1d"
    }
}

This is just a starting point and we hope to improve this process in the near future. Change these settings currently require a cluster restart, but it should be possible to configure at runtime in the future.

Let us know what you think!

There’s a lot here and we already have quite a few improvements queued up for the next iteration. In any case, if you have any questions or feedback on this process, we’d love to hear them!

20 Likes

This is fantastic news, been waiting for this ever since the switch from stolon+walg to repmgr+barman!

One query, how do I enable this with other S3-compatible storages (such as S3 itself) instead of Tigris, as we’re already pushing WALs from barman to S3.

2 Likes

Shaun can confirm but I don’t think this is possible right now short of manually changing the S3_ARCHIVE_CONFIG secret on the Machine which is in the form of:

https://<AWS_ACCESS_KEY_ID>:<AWS_ACCESS_KEY_ID>@fly.storage.tigris.dev/<BUCKET_NAME>/<FLY_APP_NAME>
2 Likes

As of right now, you would need to manually set the secret within your PG app. Some high-level instructions can be found here: WAL-Archiving + PITR by davissp14 · Pull Request #231 · fly-apps/postgres-flex · GitHub

However, once you set the endpoint and confirm everything is working the fly pg backups ... commands should work the same.

Let me know how it goes!

Okay, just one issue. If not using hard-coded credentials (e.g. using Fly as an OIDC identity provider to assume roles) then setting S3_ARCHIVE_CONFIG with no keys causes machines to fail at startup.

Looks like this only works with static AWS keys which is a no-go for me. Should I open an issue on the repo about this?

Should I open an issue on the repo about this?

If you wouldn’t mind, that would be super helpful!

1 Like
fly pg backups create --name move-to-scratch --app rotator-api-db
Exit code: 1
Performing backup...
failed to create backup: signal: killed
Error: failed to create backup: signal: killed

We are not able to create backups.

@danwetherald Will take a look!

@danwetherald You need to scale memory up to 512Mib, otherwise you’re gonna OOM. If things fail, the App logs should help provide context.

2 Likes

Thanks, I resorted to a different solution, not sure this would have worked anyways.

Thanks, I resorted to a different solution, not sure this would have worked anyways.

The post specifies a 512Mib minimum and I was able to confirm you were OOM’ing by looking at your logs. Happy to hear you got something figured out though.

1 Like

Yes, thanks.

This looks great!
How would I know if I have “postgres flex”? Is that what happens when you use flyctl to make a db (not with supabase)?

I’ve tried to enable backups, but I’m getting a response about “malformed version”

$ fly pg backups enable -a {my pg app} 
Error: Malformed version:

I’ve upgraded flyctl, so guess it could be postgres-flex version, is there a way to upgrade that? or see what it is?

If you run fly m info you’ll see something like flyio/postgres-flex:15.6 (v0.0.51) under the IMAGE heading.

Great thanks!
So looks like it’s not flex, or is this an earlier version? Is it easy to upgrade the image?

flyio/postgres:14.6 (v0.0.41)

When I migrated it wasn’t totally straightforward, but that may have changed. I’ll let someone from Fly chime in on what the current process is…

1 Like

This seems to work for moving from old Fly Postgres to Postgres Flex

Make a new DB with a different name

fly pg create --name {app}-db --enable-backups

Scale memory

fly m list # to get machine id
fly machine update --vm-memory 512 {machine_id}

Get old DATABASE_URL from a running app :cowboy_hat_face:

fly ssh console --app {app that uses the db}
env | grep DATABASE_URL

Use the import tool to import the existing database to the new DB instance

fly postgres import {DB_URL from above} -a {app}-db

I wasn’t able to directly attach, and detach didn’t work, so I found that you can attach it to the app, but with a different secret name initially.

fly postgres attach {app}-db -a {app} --variable-name DATABASE_URL_2

Override to the new DATABASE_URL

fly secrets set -a {app} DATABASE_URL={NEW DATABASE_URL}

If that updates, check the app still works, and stop the old DB instance

fly m list -a {old DB app name}
fly m stop {machine_id}

If everything still works now, the old DB instance could be deleted, but I’m going to sleep on it first!
You can also now remove the temporary secret that the DB was attached with:

fly secrets unset -a {app} DATABASE_URL_2

And now the database is running with the flex image!

1 Like

That’s not great, will see what we can do to improve the error messaging there.

1 Like

I getting stuck with 2 of our 3 database clusters:

fly pg backups enable -a myapp-demo-db
Error: backups are already enabled
fly pg backups create -a myapp-demo-db
backups are not enabled

Note that this is not the same response as Error: backups are not enabled. Run fly pg backup enable ...

My suspicion is that this has something to do with the update from 15.6 to 15.7 that was required for the 2 clusters (the other one was already on 15.7).

The steps I’ve taken:

fly pg backups enable -a myapp-demo-db

message about updating…

flyctl image update -a myapp-demo-db
fly pg backups enable -a myapp-demo-db

Deploy app.
Buckets are created in Tigris but no files.

With the curl call I’m also getting “barman is not enabled”.

I didn’t follow your steps exactly but I was able to upgrade a cluster created with backups enabled from 15.6 to 15.7 and create a backup. Initially I thought the image upgrade might not bring the creds along, but that doesn’t seem to be the case.

One step missing from your repro after enabling backups is:
fly secrets deploy -a <app-name>
The check triggering your error message is just return os.Getenv("S3_ARCHIVE_CONFIG") != "". If that variable is unset in your target cluster, you’ll get that error message when flexctl fails to create a backup. Can’t say for sure but not deploying secrets would likely not populate the environment, and if you upgrade the image without deploying secrets then the variable might not be available in the new environment. It might also be possible to fly deploy secrets post-upgrade but I can’t say for sure.

Good luck. If you did in fact deploy secrets but forgot to include that, please let me know and I’ll dig into this further.