It’s possible that it was my fault somehow, but here’s what happened:
- I misread the error in the app as non-recoverable PG failure (it happened before) and restarted the cluster
- The data reverted to ~6 months ago
- I tried creating database from the snapshots of the original database’s volume and they all seem to be out of date (missing all data)
The app access was intermittent last couple of days, but immediately before a restart I did see the data still present.
Is there something I can do to find out what happened and why the snapshots seem to be the same?
Would you be able to provide us with the name of the app you’re attempting to restore into? A lot of times the logs will provide some clues as to what may be going on.
I see a bunch of
DROP commands in your logs for your
*_web DB. Try running:
fly status --all
fly logs -i <id> for the previous couple of IDs.
We don’t issue
DROP statements like that. Is it possible something like
rails db:reset got run? Some frameworks have commands that wipe entire DBs.
Hi, this is likely me trying to detach the old database (twotm8-web-db) and reattach the new one (twotm8-db) which I restored from a snapshot, assuming it has the data.
fly pg detach failed, and I could see in the logs it trying to issue
drop user twotm8_web; which were refused.
Also, it’s not a rails app - is my shitty handrolled stuff that has no support for migrations at all Let alone any DB resets…
What I find weird in general is the exact state of the database this was reverted to - it’s from a very old original schema, that I manually created in psql.
It’s even older than the most recent, ~6 months old schema that I also rolled out using psql, but the fields don’t match.
I’ve no idea what and how the database reverted to.
What are you looking at to see the data that changed? I don’t think I’ve ever heard of Postgres reverting data, and I’m having a hard time imagining how that would have happened.
I was frantically refreshing an app page (to see if the postgres restart worked, as it was intermittently returning 500 when hitting a bad process in a swarm), and at some point the data from the page disappeared.
I also got logged out and couldn’t log in - because the column for login in the database disappeared (verified by psql).
The schema in the database (that I check with psql) is just incorrect and matches a very old version that I had before.
As I have no facilities to revert or even run migrations without manual intervention, I don’t understand how the schema could have changed after I just restarted the PG cluster.
Is it possible that something is wrong with the mounted volume instead?
I also don’t think PG can revert data, but if during restart a different volume got mounted, or if snapshots don’t work as intended, PG would just read the data on the volume, right?
Yeah, that’s possible. I checked, though, and your DB has never had more than the one volume.
What table is missing? One thing that could have happened is those
DROP commands might’ve gone through when connections closed at shutdown time. But I’d still expect data to be on snapshots if that was the case.
I’m kind of stumped!
I only started issuing drop commands (indirectly, through failed
fly pg detach) after
- I realised that data is missing (actual rows in the
twots table, all of them are gone apart from one, which was part of the initial schema setup months ago)
- I restarted the PG cluster and realised data was gone and attempted to restore from a snapshot.
thought_leaders for example has
nickname column missing, but other fields present. Nickname was added as part of a schema change months and months back…
I also got very surprised by snapshots containing same outdated schema - I went back only two snapshots, should I try and restore more and see if they have different data?
It’s kinda worrying if all of them only have the original, old data…
I went ahead and inspected a snapshot of yours from September 11th that seemed to have all of its data intact. I double checked the schema and the
nickname column was present. Would you mind giving that snapshot a try and let us know how it goes?
I think I’m doing something wrong First I restored the snapshot:
~/p/twotm8 (main) > fly pg create --snapshot-id vs_DKkNjYM9z8lnnH9X
? Choose an app name (leave blank to generate one): twotm8-db-old-snapshot
Then I tried to check the schema:
~/p/twotm8 (main) > fly pg connect -a twotm8-db-old-snapshot
Connecting to twotm8-db-old-snapshot.internal... complete
psql (14.4 (Debian 14.4-1.pgdg110+1))
Type "help" for help.
postgres=# select * from thought_leaders limit 0;
thought_leader_id | salted_hash
Which snapshot ID did you check? Am I even restoring it correctly?
Oh wait! It’s in a different database.
Sorry folks, this was quite an idiotic journey I took you on.
Let me try and figure out why a
DATABASE_URL secret unchanged from
2022-02-26T18:38:25Z suddenly points to perhaps wrong database?..
It seems to be something else entirely.
Sorry for the churn - the one component that failed miserably was indeed my own brain.
This is really the best possible outcome.