I was working on a sprite last night, but opened it up this morning to find out was a clean slate with no checkpoints, all the work from last night is gone. I asked Claude to do a little digging and this is what it turned up. I’m sorry I can’t vouch for the legitimacy of the diagnosis here but it seems potentially legit
Is there any chance of recovering the data? Would really rather not repeat an evening’s work.
There it is. The dmesg output tells the whole story. You’re right that
something went wrong — this is a platform-level filesystem corruption, not
normal behavior.
Here’s what happened at boot:
[ 5.636890] EXT4-fs error (device loop0): ext4_validate_block_bitmap:423:
bad block bitmap checksum
[ 5.638145] Aborting journal on device loop0-8.
[ 5.641591] EXT4-fs (loop0): Remounting filesystem read-only
[ 14.171574] EXT4-fs (loop0): unmounting filesystem e716dd8a-…
[ 17.825643] EXT4-fs (loop0): mounted filesystem 82dd1eed-… ←
different UUID = fresh filesystem
What this means:
loop0 is your ~100GB user data overlay (the block storage where all your
files live)
On boot, it detected EXT4 filesystem corruption — a bad block bitmap
checksum
The journal was aborted and the filesystem was remounted read-only
The platform then unmounted the corrupted filesystem and replaced it with a brand new one (note the UUID changed from e716dd8a to 82dd1eed)
So your data wasn’t lost due to ephemerality — the block storage was corrupted (possibly from an unclean shutdown or a platform-side issue), and instead of
attempting a repair (fsck), the system reformatted it.
This does seem like a bug on the Sprites platform side. A corrupted filesystem on persistent block storage should ideally be repaired, not wiped. You’d want to report this to the Sprites team — the dmesg output above is the evidence.
Unfortunately there are no old checkpoints to restore — the checkpoint data
lived on that same overlay that got replaced. Want me to help you recreate
what you had?
Had this happen again to another sprite: tnrf in the same org. This sprite was totally unrelated to the first one so I think it’s very unlikely caused by something I did. This seems like a pretty gross failure mode for sprites, would be great to get a response from the team.
And another one had been wiped this morning. wispe in the same org. I’m going to stop using sprites until this issue is resolved. Is there any way to restore these sprites?
Another one bites the dust: pilot in the same org. I just noticed that if I let them go cold they sometimes come back with the right block storage mounted. For example ‘wispe’ seems to have recovered for now.
Now my sprites seem to be ‘forgetting’ and rolling back to old checkpoints without my telling them to. For example, the wispe sprite went from not mounting at all (or mounting a totally black block storage) to mounting the freshest block (so seemed to be recovered) to now mounting the storage as it was a day ago. Is anyone else experiencing these kinds of issues?
It’s really sad to read about this, really want sprites to be good, but I stopped using about a month or so ago. I’m sure in the later half of the year, they’ll be great, but too many issues with losing data, not being able to access some sprites, etc.
Unfortunately this has been my experience as well. For the last few days I’ve experienced data loss multiple times per day, with similar symptoms: usually it appears that i’ve been rolled back to a previous checkpoint, with all modifications since that time gone.
To top it off, I’ve been unable to access my sprite at all for the last 2 hours due to 502 errors.
I love the idea of sprites but I have lost all confidence in the platform, and won’t be coming back. I was willing to accept it when it happened once – accidents happen – but this repeated failure mode, with no communication from the company, tells me everything I need to know about the respect they have for me and my data.
We are very nearly out of the woods on this one. As you might imagine, generic disks get used in a lot of very different ways. We’ve improved the storage stack a lot and I would expect far fewer weird rollbacks.
We treat the local disk as durable AND rely on an object storage level lock to determine when we trust it or not. If we think there’s an issue, we’ll rollback to a checkpoint if possible. This generally works well, but we’ve run into some object storage related bugs that make it happen when it shouldn’t be.
When you see ext4 errors, it’s similar. We opted to try and rollback rather than aggressively fsck when we detected those kinds of errors in previous environment releases. We have modified this a bit, we try a lot harder to get the most recent on disk data to a good filesystem state before we look for a previous checkpoint.
I think we have object storage to a reliable spot, I don’t believe we’ll be having more weird consistency issues with disk/object storage interactions.
This won’t necessarily feel good if you’ve run into one of these problems, but we’re tracking failure rates very closely. Less than 0.001% of Sprites have had any kind of disk issue (and there’s no real correlation to how often they’re used). There’s a lower limit we can’t beat there, but we expect the disks will be as reliable as arbitrary enterprise NVMes very soon.