Fly Postgres Backup Tweaks

Hi community,

I’m running an unmanaged Fly Postgres cluster (2 machines: primary + replica) with backups to Tigris enabled.

I have two questions about backup configuration:

1. Can I configure what time of day backups run?

Currently, backups run around midday, and I’d prefer to schedule them during off-peak hours (e.g., 2-3 AM UTC).

I’ve explored fly postgres backup config update, which allows setting:

  • --full-backup-frequency (24h)
  • --archive-timeout
  • --recovery-window
  • --minimum-redundancy

However, there’s no option to specify the actual time of day when backups execute.

Questions:

  • Is there a way to configure the backup schedule time through the CLI that I’m missing?
  • Can I manually modify the Barman cron configuration to set a specific time?
    • Such changes wouldn’t persist across machine restart/updates, right?
  • If is not through barman cron, how can we manage the backup schedule?
  • Is this something support can help configure on a per-instance basis or so?

2. Can backups run from the replica instead of the primary?

I notice some CPU spikes on primary during backups.

A good practice would be to run backups from a standby/replica to minimize load on the primary. Is this possible with our Barman setup?

Any guidance/help would be greatly appreciated! Thanks!

Hi… Do you actually see a mention of cron in the logs around the time the full backup runs?

(I thought the new fly pg backup style instead used a handcrafted Go loop, with an ad hoc timer, etc., :stopwatch:, but I could be wrong about that…)

1 Like

Ah, that’s a good point! I don’t see mention of cron in the logs. So yeah, definitely something custom.

Then the question is, would a on-demand backup reset the time the next backup runs? Let’s see…

If it’d help, I’m using github actions to backup postgres to tigris. Instead of backup.py you can use your custom script or just pgdump. Here is the code:

name: Fly backup database
run-name: Task
on:
  workflow_dispatch:
  schedule:
    - cron: '25 4 * * *'
jobs:
  backup:
    runs-on: ubuntu-latest
    env:
      FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}
      FLY_DB_APP: postgres
      PGUSER: user
      PGPASSWORD: ${{ secrets.PGPASSWORD }}
      PGDATABASE: user
      PGHOST: localhost
      PGPORT: 5555
      AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
      AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
      AWS_ENDPOINT_URL_S3: https://fly.storage.tigris.dev
      AWS_ENDPOINT_URL_IAM: https://fly.iam.storage.tigris.dev
      AWS_REGION: auto
      AWS_BUCKET: postgres

    steps:
      - name: Checkout repository
        uses: actions/checkout@v3

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'

      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install boto3 python-dotenv

      - uses: superfly/flyctl-actions/setup-flyctl@master

      - name: Set filename
        run: echo "filename=$PGDATABASE-$(date -u +"%Y-%m-%d-%H%M%S").dump" >> $GITHUB_ENV

      - name: Dump database and upload to S3
        run: |
          flyctl proxy 5555:5432 -a ${{ env.FLY_DB_APP }} &
          sleep 5
          echo "Dumping database..."
          pg_dump -Fc -f ${{ env.filename }}
          echo "Uploading to S3..."
          python scripts/backup.py upload --bucket ${{ env.AWS_BUCKET }} --source ${{ env.filename }} --destination $PGDATABASE-dumps
          echo "Cleaning up old backups..."
          python scripts/backup.py cleanup --bucket ${{ env.AWS_BUCKET }} --folder $PGDATABASE-dumps --keep 7
          echo "Backup completed successfully!"
2 Likes

Thanks @bira! A github action to do it manually would indeed work.

I was wondering if we were able to do some tweaks on just the ‘automatic’ one from Fly.

Good news! The timer resets when you do an on-demand backup, so we are off-peak hours now.

Now to see if we can hit a specific machine, or some other strategy to not overload only the primary.

1 Like

Glancing at the source code, there are several explicit checks that prevent any replica from being the one hit, :crying_cat:.

(Also, there are no override knobs evident at those points. No isPrimary || flagReallyAllowReplica kinds of things.)

My guess is that this was done to avoid having to maintain a second distinguished Machine, with its own, separate distributed-consensus votes, etc. (Most PG Flex clusters have 3 Machines, incidentally; otherwise, you don’t actually have HA.)

There might also be concerns about unknowingly backing up a lagging replica, :snowflake:, and the like.

1 Like

Thanks @mayailurus for the research and info. Very much appreciated :)!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.