https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2015-10-16T07:20:04ZCeph Ceph - Feature #13505: scrub/repair: persist scrub results.https://tracker.ceph.com/issues/13505?journal_id=602642015-10-16T07:20:04ZKefu Chaitchaikov@gmail.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/60264/diff?detail_id=58013">diff</a>)</li></ul> Ceph - Feature #13505: scrub/repair: persist scrub results.https://tracker.ceph.com/issues/13505?journal_id=602652015-10-16T07:21:43ZKefu Chaitchaikov@gmail.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/60265/diff?detail_id=58014">diff</a>)</li></ul> Ceph - Feature #13505: scrub/repair: persist scrub results.https://tracker.ceph.com/issues/13505?journal_id=602662015-10-16T07:25:48ZKefu Chaitchaikov@gmail.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/60266/diff?detail_id=58015">diff</a>)</li></ul> Ceph - Feature #13505: scrub/repair: persist scrub results.https://tracker.ceph.com/issues/13505?journal_id=602682015-10-16T08:21:14ZKefu Chaitchaikov@gmail.com
<ul><li><strong>Subject</strong> changed from <i>new scrub and repair</i> to <i>scrub/repair: persist scrub results.</i></li></ul> Ceph - Feature #13505: scrub/repair: persist scrub results.https://tracker.ceph.com/issues/13505?journal_id=608792015-10-30T17:52:44ZDavid Zafmandzafman@redhat.com
<ul></ul><p>There are some scrub errors which are not related to a specific object or involve multiple objects.</p>
<p>1. The pg_stat_t (object_stat_sum_t) contains stats for the pg as a whole. Needs to be fixed last.<br />2. A missing SnapSet in a head object requires rebuilding the SnapSet or removing all clones. Are the clones in error or the head object?<br />3. A corruption of the clone_overlap requires clone_size to be repaired first. We could use a hierarchy of inconsistencies.</p>
<p>For the first stage of this change, we should worry about object data and omap inconsistencies keeping in mind some of these more complex error types will be handled later. For pg_stat_t we could just have repair run after the last object is repaired.</p> Ceph - Feature #13505: scrub/repair: persist scrub results.https://tracker.ceph.com/issues/13505?journal_id=614772015-11-12T15:00:11ZKefu Chaitchaikov@gmail.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li><li><strong>Assignee</strong> set to <i>Kefu Chai</i></li></ul> Ceph - Feature #13505: scrub/repair: persist scrub results.https://tracker.ceph.com/issues/13505?journal_id=618912015-11-20T15:02:05ZKefu Chaitchaikov@gmail.com
<ul></ul><blockquote>
<p>2. A missing SnapSet in a head object requires rebuilding the SnapSet or removing all clones. Are the clones in error or the head object?</p>
</blockquote>
<p>they will be in the error.</p> Ceph - Feature #13505: scrub/repair: persist scrub results.https://tracker.ceph.com/issues/13505?journal_id=620232015-11-24T15:07:33ZDavid Zafmandzafman@redhat.com
<ul><li><strong>Target version</strong> set to <i>v10.0.4</i></li></ul> Ceph - Feature #13505: scrub/repair: persist scrub results.https://tracker.ceph.com/issues/13505?journal_id=629102015-12-14T09:27:20ZLoïc Dacharyloic@dachary.org
<ul></ul><p>Draft implementation at <a class="external" href="https://github.com/ceph/ceph/pull/6898">https://github.com/ceph/ceph/pull/6898</a></p> Ceph - Feature #13505: scrub/repair: persist scrub results.https://tracker.ceph.com/issues/13505?journal_id=631572015-12-17T06:43:51ZKefu Chaitchaikov@gmail.com
<ul></ul><p>wondering how can we fix the snapset inconsistencies like</p>
<ol>
<li>snap missing in snapset,clone_overlap</li>
<li>snapset.clone_size mismatches with snapset.clone_overlap</li>
</ol>
<p>for the first problem, probably the simplest way is to remove the impacted snap. while the second problem is either caused by a bug or bitrot of the authorised replica. if it's the case of bitrot, <a class="issue tracker-2 status-1 priority-4 priority-default" title="Feature: add checksum for the decode/encode (New)" href="https://tracker.ceph.com/issues/13509">#13509</a> would be helpful. otherwise we can hardly tell which replica is the correct copy without using some heuristic magic in <code>PGBackend::be_select_auth_object()</code>,</p>
<p><code>dzafman, i found that @ReplicatedPG::_scrub()</code> are repeating the check for missing/corrupted OI_ATTR done by PGBackend::be_select_auth_object(), is this on purpose?</p> Ceph - Feature #13505: scrub/repair: persist scrub results.https://tracker.ceph.com/issues/13505?journal_id=631582015-12-17T06:51:12ZKefu Chaitchaikov@gmail.com
<ul></ul><p>note to myself, in the last discussion with david, he advised that we should not overwrite the scrub result of deep scrub with the shallow one. considering an OSD with low workload, the shallow scrub is performed once a day, while the deep scrub is performed once a week. so on the week end the deep scrub result overwrites the shallow scrub result. hence some of the discrepancies are overlooked.</p>
<ul>
<li>data_digest_mismatch</li>
<li>omap_digest_mismatch</li>
<li>read_error</li>
</ul>
<p>if the content of the object/omap in question is rewritten after the deep scrub and before we do the repair, the error is very likely persists.</p>
<p>to implement this feature, we can have two omap entries for each object. one for shallow errors, the other for deep errors. and the deep scrub can rewrite both of them, while the shallow scrub can only overwrite the former one.</p> Ceph - Feature #13505: scrub/repair: persist scrub results.https://tracker.ceph.com/issues/13505?journal_id=633712015-12-23T01:49:16ZDavid Zafmandzafman@redhat.com
<ul></ul><p>Kefu Chai wrote:</p>
<blockquote>
<p><code>dzafman, i found that @ReplicatedPG::_scrub()</code> are repeating the check for missing/corrupted OI_ATTR done by PGBackend::be_select_auth_object(), is this on purpose?</p>
</blockquote>
<p>It is true that <code>ReplicatedPG::_scrub()</code> is called with an authmap selected in PGBackend::be_select_auth_object() that is present and decodes. Since this code is moving into user mode, we don't need to fix it now. Those particular checks should have been asserts. When I fixed <code>_scrub()</code> I was obsessed with not letting a corruption cause an OSD to assert during scrubbing. But in this case it shouldn't be possible.</p> Ceph - Feature #13505: scrub/repair: persist scrub results.https://tracker.ceph.com/issues/13505?journal_id=633872015-12-23T15:23:35ZKefu Chaitchaikov@gmail.com
<ul></ul><p>should not return the scrub result if the scrub is still in progress. we can</p>
<ul>
<li>check the status of pg before serving the scrubls pg command, or</li>
<li>add a sentry scrub object at the end of scrub</li>
</ul> Ceph - Feature #13505: scrub/repair: persist scrub results.https://tracker.ceph.com/issues/13505?journal_id=663022016-02-25T05:20:34ZKefu Chaitchaikov@gmail.com
<ul><li><strong>Copied to</strong> <i><a class="issue tracker-2 status-10 priority-4 priority-default closed" href="/issues/14860">Feature #14860</a>: scrub/repair: persist scrub results (do not overwrite deep scrub results with non-deep scrub)</i> added</li></ul> Ceph - Feature #13505: scrub/repair: persist scrub results.https://tracker.ceph.com/issues/13505?journal_id=663042016-02-25T05:21:16ZKefu Chaitchaikov@gmail.com
<ul><li><strong>Copied to</strong> deleted (<i><a class="issue tracker-2 status-10 priority-4 priority-default closed" href="/issues/14860">Feature #14860</a>: scrub/repair: persist scrub results (do not overwrite deep scrub results with non-deep scrub)</i>)</li></ul> Ceph - Feature #13505: scrub/repair: persist scrub results.https://tracker.ceph.com/issues/13505?journal_id=663062016-02-25T05:21:22ZKefu Chaitchaikov@gmail.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-2 status-10 priority-4 priority-default closed" href="/issues/14860">Feature #14860</a>: scrub/repair: persist scrub results (do not overwrite deep scrub results with non-deep scrub)</i> added</li></ul> Ceph - Feature #13505: scrub/repair: persist scrub results.https://tracker.ceph.com/issues/13505?journal_id=905562017-04-29T02:50:57ZKefu Chaitchaikov@gmail.com
<ul><li><strong>Assignee</strong> deleted (<del><i>Kefu Chai</i></del>)</li></ul> Ceph - Feature #13505: scrub/repair: persist scrub results.https://tracker.ceph.com/issues/13505?journal_id=1692992020-06-29T18:48:30ZPatrick Donnellypdonnell@redhat.com
<ul><li><strong>Target version</strong> deleted (<del><i>v10.0.4</i></del>)</li></ul><p>Unsetting old target version for open tickets.</p>