scrub/repair: persist scrub results.
- write out temp object as scrub goes. with key of object name, value will present what's wrong with the object, object name => whats_wrong: inconsistency_t inconsistency_t: most recent log version, prior version osd_id => shard_info_t shard_info_t - exists - omap_sha1 - data_sha1 - size - xattrs -> useronly missing on clone -> snapset - object_info_t - data error? - metadata error? - use pagination when querying the scrub result. - should always pass epic of the begin of the interval in the scrub APIs. if the epoch passes, EAGAIN is returned.
1. dump above metadata related to scrub/repair in the form of temp object, (they are already in the scrub map)
2. add simple pg command to dump it
3. add teuthology test accordingly
#5 Updated by David Zafman over 3 years ago
There are some scrub errors which are not related to a specific object or involve multiple objects.
1. The pg_stat_t (object_stat_sum_t) contains stats for the pg as a whole. Needs to be fixed last.
2. A missing SnapSet in a head object requires rebuilding the SnapSet or removing all clones. Are the clones in error or the head object?
3. A corruption of the clone_overlap requires clone_size to be repaired first. We could use a hierarchy of inconsistencies.
For the first stage of this change, we should worry about object data and omap inconsistencies keeping in mind some of these more complex error types will be handled later. For pg_stat_t we could just have repair run after the last object is repaired.
#10 Updated by Kefu Chai over 3 years ago
wondering how can we fix the snapset inconsistencies like
- snap missing in snapset,clone_overlap
- snapset.clone_size mismatches with snapset.clone_overlap
for the first problem, probably the simplest way is to remove the impacted snap. while the second problem is either caused by a bug or bitrot of the authorised replica. if it's the case of bitrot, #13509 would be helpful. otherwise we can hardly tell which replica is the correct copy without using some heuristic magic in
dzafman, i found that @ReplicatedPG::_scrub() are repeating the check for missing/corrupted OI_ATTR done by PGBackend::be_select_auth_object(), is this on purpose?
#11 Updated by Kefu Chai over 3 years ago
note to myself, in the last discussion with david, he advised that we should not overwrite the scrub result of deep scrub with the shallow one. considering an OSD with low workload, the shallow scrub is performed once a day, while the deep scrub is performed once a week. so on the week end the deep scrub result overwrites the shallow scrub result. hence some of the discrepancies are overlooked.
if the content of the object/omap in question is rewritten after the deep scrub and before we do the repair, the error is very likely persists.
to implement this feature, we can have two omap entries for each object. one for shallow errors, the other for deep errors. and the deep scrub can rewrite both of them, while the shallow scrub can only overwrite the former one.
#12 Updated by David Zafman about 3 years ago
Kefu Chai wrote:
dzafman, i found that @ReplicatedPG::_scrub()are repeating the check for missing/corrupted OI_ATTR done by PGBackend::be_select_auth_object(), is this on purpose?
It is true that
ReplicatedPG::_scrub() is called with an authmap selected in PGBackend::be_select_auth_object() that is present and decodes. Since this code is moving into user mode, we don't need to fix it now. Those particular checks should have been asserts. When I fixed
_scrub() I was obsessed with not letting a corruption cause an OSD to assert during scrubbing. But in this case it shouldn't be possible.