Actually, just tested creating a PG 5.6 instance with backups disabled, enabling them, and upgrading it without a restart. It worked and I was able to create a backup, so we seem to be handling upgrades correctly.
I’d still try redeploying secrets though. That failing check explicitly looks for the S3_ARCHIVE_CONFIG variable in your environment, and secret deployment is the step that puts it there.
Deploying the secrets of the database cluster did the trick.
I did not get that from the line “You will be required to perform a deploy after enabling to propagate the secret.” - I’ve done a deploy of the app, but apparently I should have deployed the database app.
Ugh, sorry about that. I’d love to do it automatically but it’s a bit gnarly to do that than it might seem so for now the manual redeploy is necessary. Weird that the upgrade didn’t trigger it though.
Oh, sorry about that, we changed our version-checking logic recently and just fixed a pesky missing nil check.
Should be fixed in the upcoming flyctl release on Monday, but if you need it working sooner, it should be fine once #3813 merges. Let me know if it’s still broken after that and I’ll dig in again.
I have enabled backups (yay!). Had to destroy the barman machine first, because it was running on an incompatible image. There was a small issue with secrets
> fly pg backups enable -a <..>
<...>
Backups enabled. Run `fly secrets deploy -a <...>` to restart the cluster with the new configuration.
> fly secrets deploy -a <...>
Error: no machines available to deploy
'fly secrets deploy' will only work if the app has been deployed and there are machines available
Try 'fly deploy' first
but it seems the secrets were set, at least fly secrets list shows them
What does fly status -a <...> show? Maybe your machines are stopped and it gets confused? If the machines are stopped then they should have the secret applied on next start.
Fly status shows that the primary and both replicas are started
➜ fly status
ID STATE ROLE REGION CHECKS IMAGE CREATED UPDATED
<...> started primary ams 3 total, 3 passing flyio/postgres-flex:15.7 (v0.0.60) 2023-06-25T12:08:36Z 2024-08-18T16:07:10Z
<...> started replica ams 3 total, 3 passing flyio/postgres-flex:15.7 (v0.0.60) 2024-05-16T16:23:13Z 2024-08-18T16:11:13Z
<...> started replica ams 3 total, 3 passing flyio/postgres-flex:15.7 (v0.0.60) 2023-08-09T14:23:43Z 2024-08-18T16:06:20Z
➜ fly secrets deploy
Error: no machines available to deploy
'fly secrets deploy' will only work if the app has been deployed and there are machines available
Try 'fly deploy' first
➜ fly secrets list
NAME DIGEST CREATED AT
AWS_ACCESS_KEY_ID <...> 14h58m ago
AWS_ENDPOINT_URL_S3 <...> 14h58m ago
AWS_REGION <...> 14h58m ago
AWS_SECRET_ACCESS_KEY <...> 14h58m ago
BUCKET_NAME <...> 14h58m ago
FLY_CONSUL_URL <...> Feb 26 2023 10:47
OPERATOR_PASSWORD <...> Feb 26 2023 10:47
REPL_PASSWORD <...> Feb 26 2023 10:47
S3_ARCHIVE_CONFIG <...> 14h58m ago
SSH_CERT <...> Feb 26 2023 10:47
SSH_KEY <...> Feb 26 2023 10:47
SU_PASSWORD <...> Feb 26 2023 10:47
I’d like to pipe in and say that I also had the same issue as @Elder with the `Error: no machines available to deploy". Fortunately, updating the image from 0.58 to 0.60 caused the secrets to finally get set so backups are working, but I still get the same error after and the dashboard still shows the secrets as only being staged and not deployed.
This feature is really exciting! Looks like it’s gonna save us a lot of hassle. I’ve got backups working for all of my staging clusters, but I decided test restoring from a backup before enabling this for my production clusters, and that’s when I ran into an issue. For one of my clusters, when I press the “Restore DB from backup” button it creates the new app as expected, but the new machine never gets to a healthy state. In the live logs I’m seeing things like
[info] restore | ERROR: Connection problem with ssh
and
[info] restore | 2024-08-28 22:53:49.737 UTC [359] LOG: invalid checkpoint record
[info] restore | 2024-08-28 22:53:49.737 UTC [359] FATAL: could not locate required checkpoint record
[info] restore | 2024-08-28 22:53:49.737 UTC [359] HINT: If you are restoring from a backup, touch "/data/postgresql/recovery.signal" and add required recovery options.
[info] restore | If you are not restoring from a backup, try removing the file "/data/postgresql/backup_label".
[info] restore | Be careful: removing "/data/postgresql/backup_label" will result in a corrupt cluster if restoring from a backup.
and eventually the machine gets into a state where it just keeps logging things like
[info] failed post-init: failed to establish connection to local node...
Any pointers on what the issue could be would be greatly appreciated.
If you are upgrading from a Barman setup, be aware that destroying the Barman instance (which is required, as upgrading with it is not supported by the latest image) leaves a replication slot open on the primary instance. This causes the WAL files to grow, ignoring the max WAL size limit, until all free space on the primary is filled, putting it into read-only mode. This issue cost me several hours of downtime and a stressful Sunday.