Rebalancing can propagate corrupt copy of replicated object
With 4 OSDs in a replication pool, with the replication count set to 3, I stored an object and found copies on osd0, osd1 and osd3.
I manually changed the primary copy (on osd0) to simulate corruption.
osd0 - corrupt copy (primary)
osd1 - good copy
osd3 - good copy
After that, I did "ceph osd out 3", taking out one of the good replicas, and waited for Ceph to rebalance. After that, I had copies on osd0, osd1 and osd2 as expected.
I had hoped that Ceph would have chosen the good copy as the canonical replica. Instead, it chose the corrupted primary copy, and created the new copy from that. So I ended up with:
osd0 - corrupt copy
osd1 - good copy
osd2 - corrupt copy
osd3 - out
Now I have two corrupt copies when previously I had only one. If Ceph rebalances again before anyone notices the corruption and repairs it, I could well end up with 3 corrupt copies.
If I run a scrub and a repair, Ceph correctly identifies the corrupt copies (as shown by "data_digest_mismatch" in the output from "rados list-inconsistent-obj") and restores them from the single good copy. Rebalancing should do a similar integrity check of each copy before choosing one as a canonical copy when rebalancing.
#2 Updated by Mark Houghton about 4 years ago
Thanks. I thought it might be the case that Bluestore would fix or improve this, but I haven't found a way to test that because I'm not sure how to simulate corrupting one copy of an object in Bluestore - I can't just edit the file when there's no filesystem. Can you confirm what Ceph would do if using Bluestore in this situation?
Are there any tickets I can track for the new scrub tools you mentioned?