Sprite file system corrupted?

We are very nearly out of the woods on this one. As you might imagine, generic disks get used in a lot of very different ways. We’ve improved the storage stack a lot and I would expect far fewer weird rollbacks.

We treat the local disk as durable AND rely on an object storage level lock to determine when we trust it or not. If we think there’s an issue, we’ll rollback to a checkpoint if possible. This generally works well, but we’ve run into some object storage related bugs that make it happen when it shouldn’t be.

When you see ext4 errors, it’s similar. We opted to try and rollback rather than aggressively fsck when we detected those kinds of errors in previous environment releases. We have modified this a bit, we try a lot harder to get the most recent on disk data to a good filesystem state before we look for a previous checkpoint.

I think we have object storage to a reliable spot, I don’t believe we’ll be having more weird consistency issues with disk/object storage interactions.

This won’t necessarily feel good if you’ve run into one of these problems, but we’re tracking failure rates very closely. Less than 0.001% of Sprites have had any kind of disk issue (and there’s no real correlation to how often they’re used). There’s a lower limit we can’t beat there, but we expect the disks will be as reliable as arbitrary enterprise NVMes very soon.