I just learned that my LiteFS cloud has not acquired a backup since April 12th.
I don’t have any monitoring to know this before.
Looking at the logs, I found this being repeated over and over again on the primary.
2024-05-02T20:53:27.390 app[683d64dcde1778] ord [info] level=INFO msg="begin streaming backup" full-sync-interval=10s
2024-05-02T20:53:27.693 app[683d64dcde1778] ord [info] level=INFO msg="exiting streaming backup"
2024-05-02T20:53:27.693 app[683d64dcde1778] ord [info] level=INFO msg="backup stream failed, retrying: backup stream error (\"data.db\"): write backup tx: backup client error (409): prev page 1 not found @ 0000000000001665"
I have redeployed and tried to restart and have not been able to fix this. I am concerned about removing the data.db file, as it would pull down the file from LiteFS which seems to be weeks behind.
I don’t want to distrupt the business by moving the timeline 2 weeks back. How do I resolve?
Looking through the other backup related issues this seems different, as it’s a 409 prev page 1 not found @ NUMBER
I have tried to upgrade LiteFS to 0.5.11 and redeploy.
The strange thing is when I use SFTP to download the database I get all the latest data.
fly sftp get /pb_data/data.db ./pb_data/data.db
But if LiteFS Cloud does not show a backup since April 12th, then it is behind.
I don’t know how this happened.
Additionally, it seems my website, which is in the LiteFS cluster, seems to get the latest information from the backend app that is the primary. So LiteFS seems to be working as expected, just not LiteFS cloud?
Post blow shows this fixed now - LiteFS Cloud is unreachable this morning in the Dashboard. But does not show in the status page, but does return a "unable to connect to LiteFS Cloud" error.
I was able to use HyperDX to validate how many times it appears that it tried to backup data.db and failed. It also seems that it has stopped spamming this connection issue completely, as log.db stopped backup, and has no messages what so ever.
It appears for data.db, it made 653,902 in the last 30 days.
What happened was just a unfortunate mistake on my end. I’ve deployed some dashboard code that updated some Phoenix LiveSessions for some token pages 2024-05-06T10:40:00Z and that also affected LiteFS dashboard pages. The fix was shipped at 2024-05-06T10:49:00Z then took a few minutes to propagate to our cluster.
Just looked and I’m not super savvy on litefs but since there’s already 2 weeks of no-backups maybe a simple solution would be creating a new cluster. Wdyt?
In case the team needs to do any debugging to understand the issue. LiteFS Cloud should not fail. It’s a paid service. And if it is, ideally as a user clear monitors and alerts would inform me of what steps I need to take.
LiteFS Cloud is treated as source of truth. I am nervous if I create a cluster and update my apps to point at it, that I am causing more issues, like zeroing out the contents.
If i download a copy of the DB from a instance using SFTP, I am not clear if I can upload and replace the existing data.db in the existing cluster to resolve issues. Seems easier than creating a new instance.
I am worried the DB itself is the issue, and creating a new one (either pulling from the instance, or upload the data.db) would just continue the problem.
This is now a production application running a business, so I want to make sure I am following best practices so i have as little, or ideally, no interpuption.
Hi @Zane_Milakovic, I’m sorry about the LiteFS Cloud back up issues. I did an initial investigation and I think I found the bug on our side. I’ve made a copy of the data on the service for debugging purposes so you can replace the database with your current version if you want to get it working again.
When you grab a copy of the database, use litefs export to make a copy of the database to a temporary location and then download that over SFTP. Running SFTP directly on the database can copy a corrupted version of the database since it doesn’t prevent other transactions from updating the database while you’re copying it.
Also, once you’ve SFTP’d the database, run an integrity check on it to confirm that it’s in good condition:
Should I use fly ssh console to run export on the machine itself? Do I do this from the primary or does it not matter?
Once I download with SFTP the exporter version, do I use the LiteFS Cloud interface and just replace data.db for example? Do i just upload and tell it to replace data.db?
@Zane_Milakovic Sorry for the delay. I was able to get a fix deployed so this shouldn’t happen again. For your current cluster, I think the safest option is to create a new cluster and upload a copy of your database there and switch your app over to that. I could try to tweak the existing cluster on our side but it would risk resetting your data back to April 12th.
@benbjohnson I created a new cluster and deleted the old after redeploying with the new secret.
I updated the key in both my apps to point to the new cluster.
I see my data on my site, and it’s still working just fine. I have not tried yet to download data from the cluster itself, or explore it.
But a few things -
I timed out trying to create the cluster from the dashboard, I was able to do it through CLI though
When deleting the old cluster from the dashboard, it gave a error saying it was unable to delete it. Though it disappeared, and does not show with fly litefs-cloud clusters list
Backups still do not exist on the new cluster, almost a hour later.
In my litefs.yml files, I did not change the SHARED_APP_NAME of both apps, so they had the same key and consul url as the previous cluster. I don’t know if that was a issue. Maybe that needed to be done prior to applying the secret?