Postgres restore from snapshot not working

jgnieuwhof · December 6, 2024, 5:31pm

Restoring a snapshot to a new Postgres app seems to not be working - I’ve tried the following 3 times now, all with the same results.

The command I’m using:

fly postgres create --snapshot-id vs_abcd1234 --image-ref flyio/postgres:14.6

The command creates the app, and waits for the machine to start. In the Fly UI I can see the new volume having been created, and in state “Restoring”.

After some time (the volume is ~350GB):

the CLI exits
the Fly UI shows the app as “suspended”, having no machines
the volume that was “Restoring” is changed to state “Created”
the volume has no data (0B of 500GB)

Has anyone run into this, or had success restoring a large volume?

mayailurus · December 6, 2024, 8:29pm

Hi… There was an API outage for half an hour today, so you may have intersected with that in the machine-creation phase…

We were alerted to elevated error rates for machine creates and updates.

Also, the “0B of 500GB” is (weirdly) normal, due to a UI quirk.

Hope this helps a little!

jgnieuwhof · December 9, 2024, 2:56pm

Thanks for the response! I don’t believe this was related to the API issues, as the behaviour has been consistent every time I’ve tried.

I’ve tried this one more time and ended in the same state - a volume showing 0 of 500 gb, and an app that is suspended with no machines.

I’ve tried adding a machine to the Postgres app, however it failed to start. I’m thinking I must not have all of the machine’s parameters correct, since typically fly postgres create creates the initial machine (and then that machine is cloned to scale horizontally).

What are the correct fly machine create ... arguments to add a machine to a Postgres cluster who’s initial machine failed to be created when creating a new app from a snapshot (as described above)?

jgnieuwhof · December 9, 2024, 3:04pm

This is the command that I tried:

fly machine create flyio/postgres:14.6 \
  --app dewrangle-postgres-production-v2-snapshot-dec-6 \
  --region ord \
  --vm-cpu-kind shared \
  --vm-cpus 4 \
  --vm-memory 8192 \
  -v vol_4qppx2ep6o6o0l6v:/

And here are the logs:

$ fly logs -a my-postgres-app-restored-from-snapshot
Waiting for logs...
2024-12-09T14:59:23.530 app[e825d3ef223408] ord [info] checking stolon status
2024-12-09T14:59:23.619 app[e825d3ef223408] ord [info] proxy | Running...
2024-12-09T14:59:23.625 app[e825d3ef223408] ord [info] proxy | [NOTICE] 343/145923 (429) : haproxy version is 2.2.9-2+deb11u5
2024-12-09T14:59:23.625 app[e825d3ef223408] ord [info] proxy | [NOTICE] 343/145923 (429) : path to executable is /usr/sbin/haproxy
2024-12-09T14:59:23.625 app[e825d3ef223408] ord [info] proxy | [ALERT] 343/145923 (429) : parsing [/fly/haproxy.cfg:37] : Can't create DNS resolution for server '(null)'
2024-12-09T14:59:23.625 app[e825d3ef223408] ord [info] proxy | [ALERT] 343/145923 (429) : Error(s) found in configuration file : /fly/haproxy.cfg
2024-12-09T14:59:23.625 app[e825d3ef223408] ord [info] proxy | [ALERT] 343/145923 (429) : Fatal errors found in configuration.
2024-12-09T14:59:23.625 app[e825d3ef223408] ord [info] proxy | exit status 1
2024-12-09T14:59:23.625 app[e825d3ef223408] ord [info] proxy | restarting in 1s [attempt 2]
2024-12-09T14:59:23.690 app[e825d3ef223408] ord [info] keeper is healthy, db is healthy, role: master
2024-12-09T14:59:24.627 app[e825d3ef223408] ord [info] proxy | Running...
2024-12-09T14:59:24.632 app[e825d3ef223408] ord [info] proxy | [NOTICE] 343/145924 (431) : haproxy version is 2.2.9-2+deb11u5
2024-12-09T14:59:24.632 app[e825d3ef223408] ord [info] proxy | [NOTICE] 343/145924 (431) : path to executable is /usr/sbin/haproxy
2024-12-09T14:59:24.632 app[e825d3ef223408] ord [info] proxy | [ALERT] 343/145924 (431) : parsing [/fly/haproxy.cfg:37] : Can't create DNS resolution for server '(null)'
2024-12-09T14:59:24.632 app[e825d3ef223408] ord [info] proxy | [ALERT] 343/145924 (431) : Error(s) found in configuration file : /fly/haproxy.cfg
2024-12-09T14:59:24.632 app[e825d3ef223408] ord [info] proxy | [ALERT] 343/145924 (431) : Fatal errors found in configuration.
2024-12-09T14:59:24.633 app[e825d3ef223408] ord [info] proxy | exit status 1
2024-12-09T14:59:24.633 app[e825d3ef223408] ord [info] proxy | restarting in 1s [attempt 3]
2024-12-09T14:59:25.634 app[e825d3ef223408] ord [info] proxy | Running...
2024-12-09T14:59:25.639 app[e825d3ef223408] ord [info] proxy | [NOTICE] 343/145925 (433) : haproxy version is 2.2.9-2+deb11u5
2024-12-09T14:59:25.639 app[e825d3ef223408] ord [info] proxy | [NOTICE] 343/145925 (433) : path to executable is /usr/sbin/haproxy
2024-12-09T14:59:25.639 app[e825d3ef223408] ord [info] proxy | [ALERT] 343/145925 (433) : parsing [/fly/haproxy.cfg:37] : Can't create DNS resolution for server '(null)'
2024-12-09T14:59:25.639 app[e825d3ef223408] ord [info] proxy | [ALERT] 343/145925 (433) : Error(s) found in configuration file : /fly/haproxy.cfg
2024-12-09T14:59:25.639 app[e825d3ef223408] ord [info] proxy | [ALERT] 343/145925 (433) : Fatal errors found in configuration.
2024-12-09T14:59:25.640 app[e825d3ef223408] ord [info] proxy | exit status 1

Also, the volume is now showing 528 MB of 500 GB used. It should have ~330 GB of data on it. I’m not sure if that’s because the snapshot wasn’t restored correctly, or because the machine was misconfigured / failed to start.

mayailurus · December 9, 2024, 3:19pm

Hm… Mounting at root (/) never works, although I don’t know whether that was the sole or even main problem…

(These Fly Postgres clusters really have a lot of moving parts.)

I think I would try the simpler fly m run from the thread above, just so you can poke around a little and verify that this does in fact look like a Postgres volume inside.

(The 528 MB is not reassuring—but may be a mirage caused by the mount conflict.)

If that does look ok, then perhaps the volume forking approach (with an explicit volume ID) will succeed where snapshots failed, …

jgnieuwhof · December 10, 2024, 9:54pm

Thanks for the help! I’ve verified that we can fork the running Postgres instance, but still haven’t been able to get a snapshot restore to work.

I’ve tried restoring a snapshot to a new volume in the same app (rather than a new app), and have cloned the app’s current machine with the restored volume mounted to /var/lib/postgresql/data.

The restored volume does show the correct size, however the cloned app fails to start. Any idea why this procedure is failing?

Full logs from a restart of the cloned machine:

2024-12-10T21:49:03.237 app[2867511a9ed328] ord [info] INFO Sending signal SIGINT to main child process w/ PID 359
2024-12-10T21:49:03.237 app[2867511a9ed328] ord [info] Got interrupt, stopping
2024-12-10T21:49:03.237 app[2867511a9ed328] ord [info] supervisor stopping
2024-12-10T21:49:03.237 app[2867511a9ed328] ord [info] exporter | Stopping interrupt...
2024-12-10T21:49:03.237 app[2867511a9ed328] ord [info] keeper | Stopping interrupt...
2024-12-10T21:49:03.237 app[2867511a9ed328] ord [info] proxy | Stopping interrupt...
2024-12-10T21:49:03.238 app[2867511a9ed328] ord [info] sentinel | Stopping interrupt...
2024-12-10T21:49:03.238 app[2867511a9ed328] ord [info] proxy | [WARNING] 344/214903 (377) : Exiting Master process...
2024-12-10T21:49:03.238 app[2867511a9ed328] ord [info] proxy | [NOTICE] 344/214903 (377) : haproxy version is 2.2.9-2+deb11u5
2024-12-10T21:49:03.238 app[2867511a9ed328] ord [info] proxy | [NOTICE] 344/214903 (377) : path to executable is /usr/sbin/haproxy
2024-12-10T21:49:03.238 app[2867511a9ed328] ord [info] proxy | [ALERT] 344/214903 (377) : Current worker #1 (407) exited with code 130 (Interrupt)
2024-12-10T21:49:03.238 app[2867511a9ed328] ord [info] proxy | [WARNING] 344/214903 (377) : All workers exited. Exiting... (130)
2024-12-10T21:49:03.238 app[2867511a9ed328] ord [info] keeper | 2024-12-10T21:49:03.238Z ERROR cmd/keeper.go:1364 failed to resync from followed instance {"error": "sync error: signal: interrupt"}
2024-12-10T21:49:03.238 app[2867511a9ed328] ord [info] exporter | signal: interrupt
2024-12-10T21:49:03.239 app[2867511a9ed328] ord [info] proxy | exit status 130
2024-12-10T21:49:03.240 app[2867511a9ed328] ord [info] sentinel | Process exited 0
2024-12-10T21:49:03.242 app[2867511a9ed328] ord [info] keeper | Process exited 0
2024-12-10T21:49:03.803 app[2867511a9ed328] ord [info] INFO Main child exited normally with code: 0
2024-12-10T21:49:03.819 app[2867511a9ed328] ord [info] INFO Starting clean up.
2024-12-10T21:49:03.820 app[2867511a9ed328] ord [info] INFO Umounting /dev/vdc from /var/lib/postgresql/data
2024-12-10T21:49:03.823 app[2867511a9ed328] ord [info] WARN could not unmount /rootfs: EINVAL: Invalid argument
2024-12-10T21:49:03.824 app[2867511a9ed328] ord [info] [ 579.603836] reboot: Restarting system
2024-12-10T21:49:09.398 health[2867511a9ed328] ord [error] Health check for your postgres vm has failed. Your instance has hit resource limits. Upgrading your instance / volume size or reducing your usage might help.
2024-12-10T21:49:17.325 app[2867511a9ed328] ord [info] 2024-12-10T21:49:17.325375430 [01JES7DJJG0TYFKH7YZ43BHSWW:main] Running Firecracker v1.7.0
2024-12-10T21:49:18.155 app[2867511a9ed328] ord [info] INFO Starting init (commit: 0a21fde6)...
2024-12-10T21:49:18.290 app[2867511a9ed328] ord [info] INFO Checking filesystem on /var/lib/postgresql/data
2024-12-10T21:49:18.297 app[2867511a9ed328] ord [info] /dev/vdc: clean, 12483/32768000 files, 82924761/131067904 blocks
2024-12-10T21:49:18.298 app[2867511a9ed328] ord [info] INFO Mounting /dev/vdc at /var/lib/postgresql/data w/ uid: 0, gid: 0 and chmod 0755
2024-12-10T21:49:18.307 app[2867511a9ed328] ord [info] INFO Resized /var/lib/postgresql/data to 536854134784 bytes
2024-12-10T21:49:18.313 app[2867511a9ed328] ord [info] INFO Preparing to run: `docker-entrypoint.sh start` as root
2024-12-10T21:49:18.314 app[2867511a9ed328] ord [info] INFO [fly api proxy] listening at /.fly/api
2024-12-10T21:49:18.338 runner[2867511a9ed328] ord [info] Machine started in 1.117s
2024-12-10T21:49:18.369 app[2867511a9ed328] ord [info] cluster spec filename /fly/cluster-spec.json
2024-12-10T21:49:18.370 app[2867511a9ed328] ord [info] system memory: 8192mb vcpu count: 4
2024-12-10T21:49:18.370 app[2867511a9ed328] ord [info] {
2024-12-10T21:49:18.370 app[2867511a9ed328] ord [info] "initMode": "new",
2024-12-10T21:49:18.370 app[2867511a9ed328] ord [info] "existingConfig": {},
2024-12-10T21:49:18.370 app[2867511a9ed328] ord [info] "pgParameters": {
2024-12-10T21:49:18.370 app[2867511a9ed328] ord [info] "archive_command": "if [ $ENABLE_WALG ]; then /usr/local/bin/wal-g wal-push \"%p\"; fi",
2024-12-10T21:49:18.370 app[2867511a9ed328] ord [info] "archive_mode": "on",
2024-12-10T21:49:18.370 app[2867511a9ed328] ord [info] "archive_timeout": "60",
2024-12-10T21:49:18.370 app[2867511a9ed328] ord [info] "effective_cache_size": "6144MB",
2024-12-10T21:49:18.370 app[2867511a9ed328] ord [info] "effective_io_concurrency": "200",
2024-12-10T21:49:18.370 app[2867511a9ed328] ord [info] "maintenance_work_mem": "409MB",
2024-12-10T21:49:18.370 app[2867511a9ed328] ord [info] "max_connections": "300",
2024-12-10T21:49:18.370 app[2867511a9ed328] ord [info] "max_parallel_workers": "8",
2024-12-10T21:49:18.370 app[2867511a9ed328] ord [info] "max_parallel_workers_per_gather": "2",
2024-12-10T21:49:18.370 app[2867511a9ed328] ord [info] "max_wal_size": "796MB",
2024-12-10T21:49:18.370 app[2867511a9ed328] ord [info] "max_worker_processes": "8",
2024-12-10T21:49:18.370 app[2867511a9ed328] ord [info] "min_wal_size": "199MB",
2024-12-10T21:49:18.370 app[2867511a9ed328] ord [info] "random_page_cost": "1.1",
2024-12-10T21:49:18.370 app[2867511a9ed328] ord [info] "shared_buffers": "2048MB",
2024-12-10T21:49:18.370 app[2867511a9ed328] ord [info] "wal_compression": "on",
2024-12-10T21:49:18.370 app[2867511a9ed328] ord [info] "wal_keep_size": "0",
2024-12-10T21:49:18.370 app[2867511a9ed328] ord [info] "work_mem": "128MB"
2024-12-10T21:49:18.370 app[2867511a9ed328] ord [info] },
2024-12-10T21:49:18.370 app[2867511a9ed328] ord [info] "maxStandbysPerSender": 50,
2024-12-10T21:49:18.370 app[2867511a9ed328] ord [info] "deadKeeperRemovalInterval": "1h"
2024-12-10T21:49:18.370 app[2867511a9ed328] ord [info] }
2024-12-10T21:49:18.371 app[2867511a9ed328] ord [info] generated new config
2024-12-10T21:49:18.372 app[2867511a9ed328] ord [info] sentinel | Running...
2024-12-10T21:49:18.372 app[2867511a9ed328] ord [info] keeper | Running...
2024-12-10T21:49:18.373 app[2867511a9ed328] ord [info] exporter | Running...
2024-12-10T21:49:18.373 app[2867511a9ed328] ord [info] proxy | Running...
2024-12-10T21:49:18.474 app[2867511a9ed328] ord [info] exporter | INFO[0000] Starting Server: :9187 source="postgres_exporter.go:1837"
2024-12-10T21:49:18.483 app[2867511a9ed328] ord [info] 2024/12/10 21:49:18 INFO SSH listening listen_address=[fdaa:0:51de:a7b:2a9:4579:69ca:2]:22 dns_server=[fdaa::3]:53
2024-12-10T21:49:18.494 app[2867511a9ed328] ord [info] proxy | [NOTICE] 344/214918 (383) : New worker #1 (409) forked
2024-12-10T21:49:18.497 app[2867511a9ed328] ord [info] proxy | [WARNING] 344/214918 (409) : bk_db/pg1 changed its IP from (none) to fdaa:0:51de:a7b:2a9:4579:69ca:2 by flydns/dns1.
2024-12-10T21:49:18.497 app[2867511a9ed328] ord [info] proxy | [WARNING] 344/214918 (409) : Server bk_db/pg1 ('ord.<appnameremoved>.internal') is UP/READY (resolves again).
2024-12-10T21:49:18.497 app[2867511a9ed328] ord [info] proxy | [WARNING] 344/214918 (409) : Server bk_db/pg1 administratively READY thanks to valid DNS answer.
2024-12-10T21:49:18.920 app[2867511a9ed328] ord [info] sentinel | 2024-12-10T21:49:18.920Z ERROR cmd/sentinel.go:1018 no eligible masters
2024-12-10T21:49:19.372 app[2867511a9ed328] ord [info] checking stolon status
2024-12-10T21:49:20.372 app[2867511a9ed328] ord [info] checking stolon status
2024-12-10T21:49:21.372 app[2867511a9ed328] ord [info] checking stolon status
2024-12-10T21:49:22.373 app[2867511a9ed328] ord [info] checking stolon status
2024-12-10T21:49:23.372 app[2867511a9ed328] ord [info] checking stolon status
2024-12-10T21:49:24.148 app[2867511a9ed328] ord [info] sentinel | 2024-12-10T21:49:24.147Z WARN cmd/sentinel.go:287 received db state for unexpected db uid {"receivedDB": "", "db": "3a41e2b1", "keeper": "2a9457969ca2"}
2024-12-10T21:49:24.148 app[2867511a9ed328] ord [info] sentinel | 2024-12-10T21:49:24.147Z WARN cmd/sentinel.go:276 no keeper info available {"db": "8f85aefd", "keeper": "111e5b77ebc2"}
2024-12-10T21:49:24.372 app[2867511a9ed328] ord [info] checking stolon status
2024-12-10T21:49:24.468 app[2867511a9ed328] ord [info] keeper is healthy, db is healthy, role: standby
2024-12-10T21:49:25.307 app[2867511a9ed328] ord [info] proxy | [WARNING] 344/214925 (409) : Backup Server bk_db/pg is DOWN, reason: Layer7 timeout, check duration: 5000ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2024-12-10T21:49:25.489 app[2867511a9ed328] ord [info] proxy | [WARNING] 344/214925 (409) : Server bk_db/pg1 is DOWN, reason: Layer7 timeout, check duration: 5000ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2024-12-10T21:49:25.489 app[2867511a9ed328] ord [info] proxy | [NOTICE] 344/214925 (409) : haproxy version is 2.2.9-2+deb11u5
2024-12-10T21:49:25.489 app[2867511a9ed328] ord [info] proxy | [NOTICE] 344/214925 (409) : path to executable is /usr/sbin/haproxy
2024-12-10T21:49:25.489 app[2867511a9ed328] ord [info] proxy | [ALERT] 344/214925 (409) : backend 'bk_db' has no server available!
2024-12-10T21:49:29.396 app[2867511a9ed328] ord [info] sentinel | 2024-12-10T21:49:29.395Z WARN cmd/sentinel.go:287 received db state for unexpected db uid {"receivedDB": "", "db": "3a41e2b1", "keeper": "2a9457969ca2"}
2024-12-10T21:49:29.396 app[2867511a9ed328] ord [info] sentinel | 2024-12-10T21:49:29.395Z WARN cmd/sentinel.go:276 no keeper info available {"db": "8f85aefd", "keeper": "111e5b77ebc2"}
2024-12-10T21:49:29.416 health[2867511a9ed328] ord [info] Health check for your postgres vm is now passing.
2024-12-10T21:49:34.611 app[2867511a9ed328] ord [info] sentinel | 2024-12-10T21:49:34.611Z WARN cmd/sentinel.go:276 no keeper info available {"db": "8f85aefd", "keeper": "111e5b77ebc2"}
2024-12-10T21:49:39.848 app[2867511a9ed328] ord [info] sentinel | 2024-12-10T21:49:39.848Z WARN cmd/sentinel.go:276 no keeper info available {"db": "8f85aefd", "keeper": "111e5b77ebc2"}
2024-12-10T21:49:41.922 app[2867511a9ed328] ord [info] exporter | INFO[0023] Established new database connection to "fdaa:0:51de:a7b:2a9:4579:69ca:2:5433". source="postgres_exporter.go:970"
2024-12-10T21:49:42.922 app[2867511a9ed328] ord [info] exporter | ERRO[0024] Error opening connection to database (postgresql://flypgadmin:PASSWORD_REMOVED@[fdaa:0:51de:a7b:2a9:4579:69ca:2]:5433/postgres?sslmode=disable): dial tcp [fdaa:0:51de:a7b:2a9:4579:69ca:2]:5433: connect: connection refused source="postgres_exporter.go:1658"

mayailurus · December 10, 2024, 11:29pm

Hm… I made a small v14.6 instance of my own for testing, and it looks like it actually wants this mounted at /data (without a /var/lib/ prefix).

For snapshots, I think the workaround for now is to first restore the snapshot into a volume that has no attached machine, and then use the fly pg create --fork-from old-app:vol_id --name new-app --stolon variant.

I just tried that with a dead old-app that had only the volume† and no machines whatsoever, and it did successfully create a functional, running new-app.

Without the explicit vol_id, it will complain about not being able to find the primary.

I think that what may have been happening before is that you were getting hit with a lease expiration or similar while the (long) snapshot restoration was in progress. The unmanaged Fly Postgres line is in a bit of a limbo state right now, since there’s a fully managed alternative in development (based on Kubernetes). Really large restores might not be getting a lot of ongoing testing.

It probably is possible to get this to work by creating machines by hand, but personally I think I would exhaust the --fork-from possibilities first (particularly for a production server).

Aside: The --stolon flag really is necessary, if there are no existing machines for it to base guesses on.

†Admittedly, I didn’t actually do the snapshotting steps of the experiment, but that wouldn’t likely change anything.

jgnieuwhof · December 11, 2024, 3:15pm

I’ve just followed this procedure with a volume restored from a snapshot yesterday, and had a working postgres app within minutes.

Nice to know that there’s a managed solution coming down the pike too!

Thank you so much for the help

system · December 18, 2024, 3:15pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Reuse old volume in new machine after "irrepairable hardware damage" postgres , volumes	43	254	August 20, 2024
Unable to recover Postgres backup	5	608	September 13, 2023
Fly Postgres machine crashed, won't start or stop postgres	8	77	February 10, 2025
How to reuse volume in different app? Questions / Help	9	638	September 6, 2023
Where is our Data? (Postgres Database Volume 0MB) Questions / Help	16	1210	December 2, 2023

Postgres restore from snapshot not working

Related topics