LiteFS Cloud Backup Failed

I just learned that my LiteFS cloud has not acquired a backup since April 12th.

I don’t have any monitoring to know this before.

Looking at the logs, I found this being repeated over and over again on the primary.

2024-05-02T20:53:27.390 app[683d64dcde1778] ord [info] level=INFO msg="begin streaming backup" full-sync-interval=10s

2024-05-02T20:53:27.693 app[683d64dcde1778] ord [info] level=INFO msg="exiting streaming backup"

2024-05-02T20:53:27.693 app[683d64dcde1778] ord [info] level=INFO msg="backup stream failed, retrying: backup stream error (\"data.db\"): write backup tx: backup client error (409): prev page 1 not found @ 0000000000001665"

I have redeployed and tried to restart and have not been able to fix this. I am concerned about removing the data.db file, as it would pull down the file from LiteFS which seems to be weeks behind.

I don’t want to distrupt the business by moving the timeline 2 weeks back. How do I resolve?

Looking through the other backup related issues this seems different, as it’s a 409 prev page 1 not found @ NUMBER

2 Likes

Added issue, sqlite

I tried to restart the application.

I have tried to upgrade LiteFS to 0.5.11 and redeploy.

The strange thing is when I use SFTP to download the database I get all the latest data.

fly sftp get /pb_data/data.db ./pb_data/data.db

But if LiteFS Cloud does not show a backup since April 12th, then it is behind.

I don’t know how this happened.

Additionally, it seems my website, which is in the LiteFS cluster, seems to get the latest information from the backend app that is the primary. So LiteFS seems to be working as expected, just not LiteFS cloud?

1 Like

I am now seeing similar behavior with logs.db as well. Now it as well stopped taking backups.

April 12th for data.db, May 3rd for log.db. Unclear why LiteFS cloud is failing outside of the error message above.

Post blow shows this fixed now -
LiteFS Cloud is unreachable this morning in the Dashboard. But does not show in the status page, but does return a "unable to connect to LiteFS Cloud" error.

I was able to use HyperDX to validate how many times it appears that it tried to backup data.db and failed. It also seems that it has stopped spamming this connection issue completely, as log.db stopped backup, and has no messages what so ever.

It appears for data.db, it made 653,902 in the last 30 days.

652,429 were all from this fly.app.instance - 683d64dcde1778

Hey there! This is actually just a dashboard display issue, Im shipping a fix right now for it

Should be back online now!

What happened was just a unfortunate mistake on my end. I’ve deployed some dashboard code that updated some Phoenix LiveSessions for some token pages 2024-05-06T10:40:00Z and that also affected LiteFS dashboard pages. The fix was shipped at 2024-05-06T10:49:00Z then took a few minutes to propagate to our cluster.

Sorry about that.

That is great to hear.

Any chance I can get eyes on any of the other LiteFS issues above. =)

No response to any support emails yet, so sorry to jump on you Lubien, just been a few days.

1 Like

np, things happen.

Just looked and I’m not super savvy on litefs but since there’s already 2 weeks of no-backups maybe a simple solution would be creating a new cluster. Wdyt?

Ill also raise this to the LiteFS team!

Thank you for raising it.

I am nervous to do that for a few reasons.

  1. In case the team needs to do any debugging to understand the issue. LiteFS Cloud should not fail. It’s a paid service. And if it is, ideally as a user clear monitors and alerts would inform me of what steps I need to take.

  2. LiteFS Cloud is treated as source of truth. I am nervous if I create a cluster and update my apps to point at it, that I am causing more issues, like zeroing out the contents.

  3. If i download a copy of the DB from a instance using SFTP, I am not clear if I can upload and replace the existing data.db in the existing cluster to resolve issues. Seems easier than creating a new instance.

  4. I am worried the DB itself is the issue, and creating a new one (either pulling from the instance, or upload the data.db) would just continue the problem.

  5. This is now a production application running a business, so I want to make sure I am following best practices so i have as little, or ideally, no interpuption.

1 Like

Hi @Zane_Milakovic, I’m sorry about the LiteFS Cloud back up issues. I did an initial investigation and I think I found the bug on our side. I’ve made a copy of the data on the service for debugging purposes so you can replace the database with your current version if you want to get it working again.

When you grab a copy of the database, use litefs export to make a copy of the database to a temporary location and then download that over SFTP. Running SFTP directly on the database can copy a corrupted version of the database since it doesn’t prevent other transactions from updating the database while you’re copying it.

Also, once you’ve SFTP’d the database, run an integrity check on it to confirm that it’s in good condition:

$ sqlite3 /path/to/database "PRAGMA integrity_check"
1 Like

Few quick questions -

  1. Should I use fly ssh console to run export on the machine itself? Do I do this from the primary or does it not matter?

  2. Once I download with SFTP the exporter version, do I use the LiteFS Cloud interface and just replace data.db for example? Do i just upload and tell it to replace data.db?

It doesn’t matter as long as the replica is connected to the primary so it shouldn’t be too far behind.

Yes, you can upload and replace using the LiteFS Cloud interface. You can also use the flyctl CLI to import: fly litefs cloud import · Fly Docs

Ok. Failure. =)

Steps I took -

  1. fly ssh console
  2. litefs export --name data.db ./data.db
  3. exit console
  4. fly sftp get /data.db ./pb_data/data.db
  5. sqlite3 pb_data/data.db “PRAGMA integrity_check” (received OK)
  6. fly litefs-cloud import -c brickdropco -d data.db --input pb_data/data.db

Output -

Error: prev page 1 not found @ 0000000000001665 [ENOPREVPAGE]

Which mirrors the console issue I had in the logs.

I have not tried the web interface, do you suggest I try that?

I tried to GUI in the browser -

And this is the result -

Once again, I did a fresh export of the Database, downloaded with SFTP, and did the PRAGMA integrity check first…

@benbjohnson is there anything else you suggest I try?

@Zane_Milakovic Sorry for the delay. I was able to get a fix deployed so this shouldn’t happen again. For your current cluster, I think the safest option is to create a new cluster and upload a copy of your database there and switch your app over to that. I could try to tweak the existing cluster on our side but it would risk resetting your data back to April 12th.

Ok. Let me try that. Thank you.

@benbjohnson I created a new cluster and deleted the old after redeploying with the new secret.

I updated the key in both my apps to point to the new cluster.

I see my data on my site, and it’s still working just fine. I have not tried yet to download data from the cluster itself, or explore it.

But a few things -

  1. I timed out trying to create the cluster from the dashboard, I was able to do it through CLI though
  2. When deleting the old cluster from the dashboard, it gave a error saying it was unable to delete it. Though it disappeared, and does not show with fly litefs-cloud clusters list
  3. Backups still do not exist on the new cluster, almost a hour later.

One thing to call out, the

  consul:
    url: "${FLY_CONSUL_URL}"
    key: "litefs/${SHARED_APP_NAME}"

In my litefs.yml files, I did not change the SHARED_APP_NAME of both apps, so they had the same key and consul url as the previous cluster. I don’t know if that was a issue. Maybe that needed to be done prior to applying the secret?

Not sure where I went wrong…