Suspended database

Hello, I’m impacted in the ORD region and Fly suspended my database but left my other nodes up on an account that’s run for 2+ years. Is there a way to get my db back up long enough to get a backup so I can move it?

I have a few other web services that could be suspended, stopped, or deleted to make room to get a backup, but I’m not sure that would do anything.

Hi… Do you mean a message on the dashboard indicating an underlying physical host failure? What does fly m list -a db-app-name currently show?

Here you go:

➜ fly m list -a lfk-db
1 machines have been retrieved from app lfk-db.
View them in the UI here

lfk-db
ID            	NAME               	STATE  	CHECKS	REGION	ROLE                      	IMAGE                        	IP ADDRESS                      	VOLUME              	CREATED             	LAST UPDATED        	PROCESS GROUP	SIZE
5683936f6e2d8e	wandering-frog-4311	stopped	0/3   	ord   	the machine hasn't started	flyio/postgres:14.6 (v0.0.34)	fdaa:0:7016:a7b:9adb:da70:4421:2	vol_gez1nvx789w4mxl7	2023-01-19T00:52:03Z	2025-03-17T15:25:14Z	app          	shared-cpu-1x:256MB

It was suspended by fly over the weekend, fwiw. It wasn’t manually suspended in case that matters.

Try fly logs -a lfk-db first. If nothing looks like an actual error there, then fly m start -a lfk-db might be sufficient…

This times out after a few minutes:

➜ fly logs -a lfk-db

Also when I run:

➜ fly m start -a lfk-db
? Select machines: 5683936f6e2d8e wandering-frog-4311 (stopped, region ord, process group 'app')
Error: could not start machine 5683936f6e2d8e: failed to start VM 5683936f6e2d8e: failed_precondition: machine still active, refusing to start (Request ID: 01JPJCKVSJWSMN9PKBGZ6HDW1W-chi)

When I run fly logs -a lfk-db --debug --verbose I am seeing something that I’m not sure I have access or control of:

2025-03-17T15:51:44Z app[5683936f6e2d8e] ord [info]keeper   | 2025-03-17T15:51:44.430Z	ERROR	cmd/keeper.go:811	error retrieving cluster data	{"error": "invalid character '\\x02' in string literal"}
2025-03-17T15:51:44Z app[5683936f6e2d8e] ord [info]keeper   | 2025-03-17T15:51:44.750Z	ERROR	cmd/keeper.go:1041	error retrieving cluster data	{"error": "invalid character '\\x02' in string literal"}
2025-03-17T15:51:44Z app[5683936f6e2d8e] ord [info]keeper   | 2025-03-17T15:51:44.829Z	ERROR	cmd/keeper.go:719	cannot get configured pg parameters	{"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2025-03-17T15:51:44Z app[5683936f6e2d8e] ord [info]sentinel | 2025-03-17T15:51:44.909Z	ERROR	cmd/sentinel.go:1852	error retrieving cluster data	{"error": "invalid character '\\x02' in string literal"}
2025-03-17T15:51:45Z app[5683936f6e2d8e] ord [info]panic: error checking stolon status: cannot get cluster data: invalid character '\x02' in string literal
2025-03-17T15:51:45Z app[5683936f6e2d8e] ord [info]: exit status 1
2025-03-17T15:51:45Z app[5683936f6e2d8e] ord [info]goroutine 9 [running]:
2025-03-17T15:51:45Z app[5683936f6e2d8e] ord [info]main.main.func2(0xc0000c8000, 0xc00007ca00)
2025-03-17T15:51:45Z app[5683936f6e2d8e] ord [info]	/go/src/github.com/fly-examples/postgres-ha/cmd/start/main.go:81 +0x72c
2025-03-17T15:51:45Z app[5683936f6e2d8e] ord [info]created by main.main
2025-03-17T15:51:45Z app[5683936f6e2d8e] ord [info]	/go/src/github.com/fly-examples/postgres-ha/cmd/start/main.go:72 +0x43b
2025-03-17T15:51:45Z app[5683936f6e2d8e] ord [info] INFO Main child exited normally with code: 2
2025-03-17T15:51:45Z app[5683936f6e2d8e] ord [info] WARN Reaped child process with pid: 654 and signal: SIGKILL, core dumped? false
2025-03-17T15:51:45Z app[5683936f6e2d8e] ord [info] WARN Reaped child process with pid: 652 and signal: SIGKILL, core dumped? false
2025-03-17T15:51:45Z app[5683936f6e2d8e] ord [info] WARN Reaped child process with pid: 656 and signal: SIGHUP, core dumped? false
2025-03-17T15:51:45Z app[5683936f6e2d8e] ord [info] WARN Reaped child process with pid: 694 and signal: SIGHUP, core dumped? false
2025-03-17T15:51:45Z app[5683936f6e2d8e] ord [info] INFO Starting clean up.
2025-03-17T15:51:45Z app[5683936f6e2d8e] ord [info] INFO Umounting /dev/vdc from /data
2025-03-17T15:51:45Z app[5683936f6e2d8e] ord [info] WARN could not unmount /rootfs: EINVAL: Invalid argument
2025-03-17T15:51:45Z app[5683936f6e2d8e] ord [info][    6.283912] reboot: Restarting system
2025-03-17T15:51:48Z app[5683936f6e2d8e] ord [info]2025-03-17T15:51:48.316768060 [01JPFTKYF7S8K42V8DZ57V9ND4:main] Running Firecracker v1.7.0

Not sure how helpful that is.

That does help… A quick search on the distinctive “invalid character” part yielded several earlier reports in the forum, one of which had a resolution:

I don’t know the exact details of what Fly.io did behind the scenes with that fix, though.

If worse comes to worst, you can try the volume-forking technique or restore from a snapshot.

(Tweak --initial-cluster-size as desired, of course.)

1 Like

I finally fixed it, but forking the volume and creating a new database.

fly postgres create --fork-from lfk-db:vol_<volumn_hash_here>

After this completed, I had to update my app so that it could see my new database:

fly secrets set DATABASE_URL postgres://postgres:.../DATABASE_NAME

Fly will create the database but, you’ll have to add your DATABASE_NAME to the DATABASE_URL value.

Thank you for the help.

For anyone else reading this, I highly recommend trying out the backup and snapshot sub-commands. It appears much has changed in the two years since I created this database and I’m not sure I will trust Fly for anything other than web nodes going forward.

We undoubtedly did a rather poor job of drawing your attention to the following page:

Here’s our plan to address this going forward:

Thank you for the update about the newer Managed Postgres service. It’s not that I didn’t see that launching a Postgres instance (this was one of your biggest featured ~3 years ago when you launched) would ever break due to whatever backend changes you were making or capacity issues you were experiencing.

So while I understand that newer Managed Postgres has a different meaning because of automated backups, etc. it’s still off putting to see an error that isn’t something someone can fix without moving to a newer server.

To back up, it’s not about the warning, it’s about something breaking and there not being a good path forward without relaunching from a backup. This feels like a backend issue, not a small trafficed node issue.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.