I got an email yesterday to say there was some emergency maintenance going on with one of my apps (turns out the database), I figured “ok, fine, nothing to worry about”, but now, after 24 hours, it’s still saying this:
To make matters worse, some of the suggestions it gives (like cloning) won’t work as there is an active incident going on with my app, which is less than helpful.
What other options do I have here? I can’t just delete the database and recreate it.
From my experience it can take quite long, it might mean hardware failure and a need of physical access, parts being fixed/replaced. That can takes days in some cases.
Yes it is, it’s only a hobby project, but I have space for one more machine, so when it does eventually come back up, I’ll see if i can increase the instance count.
Ugh, not ideal, but I guess it’s a valid situation, would be good to know that that’s the issue, at least Fly could provide an update telling me it’s a hardware issue.
They do volume snapshots once per day, check if you can restore from the most recent snapshot. Fly recommends 3 instances for a prod setup, because in case of SSD failure the data since the last snapshot is lost .
For a single server setup you could trigger backups a few times per day. Some kind of a cron job that runs a backup script
Ah, I can see in the dashboard a “How to use” button on the snapshots with the commands on how to create a new database from that snapshot. I’ll give that a try, thanks for the pointer.
So I used the command to create a new postgres database using the snapshot, but the command to create ended up quitting with “context deadline exceeded”, and when I go to the new database in the dashboard, it’s only completed 1/3 checks, with the following error:
500 Internal Server Error failed to connect to repmgr node: failed to connect to `host=fdaa:2:7bf2:a7b:328:90fb:e9b7:2 user=repmgr database=repmgr`: server error (FATAL: database "repmgr" does not exist (SQLSTATE 3D000))
It also won’t let me do a fly postgres attach command, that fails with an error of no active leader found, presumably because of the above error.
I’ve tried stopping the new machine and starting it back up again, but I’m getting the same 500 error, so now I appear stuck as to what to do next.