https://tracker.ceph.com/
https://tracker.ceph.com/favicon.ico
2019-08-21T20:24:43Z
Ceph
RADOS - Bug #41385: osd/ReplicatedBackend.cc: 1349: FAILED ceph_assert(peer_missing.count(fromshard))
https://tracker.ceph.com/issues/41385?journal_id=143837
2019-08-21T20:24:43Z
Neha Ojha
nojha@redhat.com
<ul><li><strong>Assignee</strong> set to <i>Neha Ojha</i></li></ul>
RADOS - Bug #41385: osd/ReplicatedBackend.cc: 1349: FAILED ceph_assert(peer_missing.count(fromshard))
https://tracker.ceph.com/issues/41385?journal_id=144759
2019-08-28T20:44:41Z
Neha Ojha
nojha@redhat.com
<ul><li><strong>Status</strong> changed from <i>12</i> to <i>In Progress</i></li></ul>
RADOS - Bug #41385: osd/ReplicatedBackend.cc: 1349: FAILED ceph_assert(peer_missing.count(fromshard))
https://tracker.ceph.com/issues/41385?journal_id=144895
2019-08-29T21:24:10Z
Neha Ojha
nojha@redhat.com
<ul></ul><p>Here's the chain of events that causes this:</p>
<p>Two objects go missing on the primary, and we want to recover them from other copies on 3 and 4.</p>
<pre>
2019-08-21T17:06:40.141+0000 7f6213ac3700 10 osd.1 pg_epoch: 55 pg[2.a( v 43'4288 (22'1288,43'4288] local-lis/les=54/55 n=268 ec=16/16 lis/c=54/54 les/c/f=55/55/0 sis=54) [1,4] r=0 lpr=54 crt=43'4288 mlcod 0'0 active+clean] rep_repair_primary_object 2:55e1c03f:::benchmark_data_smithi043_11387_object232:head peers osd.{1,4}
2019-08-21T17:06:40.141+0000 7f6213ac3700 10 osd.1 pg_epoch: 55 pg[2.a( v 43'4288 (22'1288,43'4288] local-lis/les=54/55 n=268 ec=16/16 lis/c=54/54 les/c/f=55/55/0 sis=54) [1,4] r=0 lpr=54 crt=43'4288 mlcod 0'0 active+clean+repair m=1] read got -11 / 65536 bytes from obj 2:55e1c03f:::benchmark_data_smithi043_11387_object232:head
2019-08-21T17:06:40.141+0000 7f6213ac3700 -1 log_channel(cluster) log [ERR] : 2.a missing primary copy of 2:55e1c03f:::benchmark_data_smithi043_11387_object232:head, will try copies on 3,4
2019-08-21T17:06:40.141+0000 7f620fabb700 10 osd.1 pg_epoch: 55 pg[2.a( v 43'4288 (22'1288,43'4288] local-lis/les=54/55 n=268 ec=16/16 lis/c=54/54 les/c/f=55/55/0 sis=54) [1,4] r=0 lpr=54 crt=43'4288 mlcod 0'0 active+clean+repair m=1] rep_repair_primary_object 2:5fac30a9:::benchmark_data_smithi043_11387_object19:head peers osd.{1,4}
2019-08-21T17:06:40.141+0000 7f620fabb700 5 osd.1 pg_epoch: 55 pg[2.a( v 43'4288 (22'1288,43'4288] local-lis/les=54/55 n=268 ec=16/16 lis/c=54/54 les/c/f=55/55/0 sis=54) [1,4] r=0 lpr=54 crt=43'4288 mlcod 0'0 active+clean+repair m=1] rep_repair_primary_object: Read error on 2:5fac30a9:::benchmark_data_smithi043_11387_object19:head, but already seen errors
2019-08-21T17:06:40.141+0000 7f620fabb700 10 osd.1 pg_epoch: 55 pg[2.a( v 43'4288 (22'1288,43'4288] local-lis/les=54/55 n=268 ec=16/16 lis/c=54/54 les/c/f=55/55/0 sis=54) [1,4] r=0 lpr=54 crt=43'4288 mlcod 0'0 active+clean+repair m=1] read got -11 / 65536 bytes from obj 2:5fac30a9:::benchmark_data_smithi043_11387_object19:head
2019-08-21T17:06:40.141+0000 7f620fabb700 -1 log_channel(cluster) log [ERR] : 2.a missing primary copy of 2:5fac30a9:::benchmark_data_smithi043_11387_object19:head, will try copies on 3,4
</pre>
<p>We send a PGRemove to osd.3 in purge_strays.</p>
<pre>
2019-08-21T17:06:40.143+0000 7f6213ac3700 10 osd.1 pg_epoch: 55 pg[2.a( v 43'4288 (22'1288,43'4288] local-lis/les=54/55 n=268 ec=16/16 lis/c=54/54 les/c/f=55/55/0 sis=54) [1,4] r=0 lpr=54 crt=43'4288 mlcod 0'0 active+recovery_wait+degraded m=1 mbc={255={(1+1)=2}}] purge_strays 3
2019-08-21T17:06:40.143+0000 7f6213ac3700 10 osd.1 pg_epoch: 55 pg[2.a( v 43'4288 (22'1288,43'4288] local-lis/les=54/55 n=268 ec=16/16 lis/c=54/54 les/c/f=55/55/0 sis=54) [1,4] r=0 lpr=54 crt=43'4288 mlcod 0'0 active+recovery_wait+degraded m=1 mbc={255={(1+1)=2}}] sending PGRemove to osd.3
</pre><br />and in this process we remove osd.3 from peer_missing but not from missing_loc.
<p>See PeeringState::purge_strays() <a class="external" href="https://github.com/ceph/ceph/blob/master/src/osd/PeeringState.cc#L213-L214">https://github.com/ceph/ceph/blob/master/src/osd/PeeringState.cc#L213-L214</a></p>
<p>then we try to recover those two objects<br /><pre>
2019-08-21T17:06:40.147+0000 7f6213ac3700 7 osd.1 pg_epoch: 55 pg[2.a( v 43'4288 (22'1288,43'4288] local-lis/les=54/55 n=268 ec=16/16 lis/c=54/54 les/c/f=55/55/0 sis=54) [1,4] r=0 lpr=54 crt=43'4288 mlcod 0'0 active+recovering rops=1 m=1 mbc={255={(1+1)=2}}] pull 2:55e1c03f:::benchmark_data_smithi043_11387_object232:head v 19'224 on osds 3,4 from osd.4
</pre></p>
<p>This works fine because we try to recover it using osd.4, which is still present in peer_missing</p>
<p>The next one fails</p>
<pre>
2019-08-21T17:06:40.152+0000 7f6213ac3700 7 osd.1 pg_epoch: 55 pg[2.a( v 43'4288 (22'1288,43'4288] local-lis/les=54/55 n=268 ec=16/16 lis/c=54/54 les/c/f=55/55/0 sis=54) [1,4] r=0 lpr=54 crt=43'4288 mlcod 0'0 active+recovering rops=2 m=1 mbc={255={(1+1)=2}}] pull 2:5fac30a9:::benchmark_data_smithi043_11387_object19:head v 18'32 on osds 3,4 from osd.3
</pre>
<p>Because we try to get it from osd.3, which is no longer in peer_missing</p>
<p>I think the fix is to clear missing_loc as well when we cleared peer_missing in purge_strays()</p>
RADOS - Bug #41385: osd/ReplicatedBackend.cc: 1349: FAILED ceph_assert(peer_missing.count(fromshard))
https://tracker.ceph.com/issues/41385?journal_id=144971
2019-08-30T18:15:08Z
Neha Ojha
nojha@redhat.com
<ul></ul><p>Have been able to reproduce it here: <a class="external" href="http://pulpito.ceph.com/nojha-2019-08-28_19:12:09-rados:singleton-master-distro-basic-smithi/">http://pulpito.ceph.com/nojha-2019-08-28_19:12:09-rados:singleton-master-distro-basic-smithi/</a></p>
RADOS - Bug #41385: osd/ReplicatedBackend.cc: 1349: FAILED ceph_assert(peer_missing.count(fromshard))
https://tracker.ceph.com/issues/41385?journal_id=145500
2019-09-06T23:38:06Z
Neha Ojha
nojha@redhat.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Fix Under Review</i></li></ul><p><a class="external" href="https://github.com/ceph/ceph/pull/30119">https://github.com/ceph/ceph/pull/30119</a> (merged September 4, 2019)<br /><a class="external" href="https://github.com/ceph/ceph/pull/30059">https://github.com/ceph/ceph/pull/30059</a> (merged September 7, 2019)<br /><a class="external" href="https://github.com/ceph/ceph/pull/30226">https://github.com/ceph/ceph/pull/30226</a> (merged September 9, 2019)</p>
RADOS - Bug #41385: osd/ReplicatedBackend.cc: 1349: FAILED ceph_assert(peer_missing.count(fromshard))
https://tracker.ceph.com/issues/41385?journal_id=145655
2019-09-09T16:53:50Z
Neha Ojha
nojha@redhat.com
<ul><li><strong>Status</strong> changed from <i>Fix Under Review</i> to <i>Pending Backport</i></li><li><strong>Backport</strong> set to <i>luminous,mimic,nautilus</i></li></ul>
RADOS - Bug #41385: osd/ReplicatedBackend.cc: 1349: FAILED ceph_assert(peer_missing.count(fromshard))
https://tracker.ceph.com/issues/41385?journal_id=145660
2019-09-09T19:39:28Z
Nathan Cutler
ncutler@suse.cz
<ul><li><strong>Copied to</strong> <i><a class="issue tracker-9 status-3 priority-4 priority-default closed" href="/issues/41730">Backport #41730</a>: luminous: osd/ReplicatedBackend.cc: 1349: FAILED ceph_assert(peer_missing.count(fromshard))</i> added</li></ul>
RADOS - Bug #41385: osd/ReplicatedBackend.cc: 1349: FAILED ceph_assert(peer_missing.count(fromshard))
https://tracker.ceph.com/issues/41385?journal_id=145662
2019-09-09T19:39:36Z
Nathan Cutler
ncutler@suse.cz
<ul><li><strong>Copied to</strong> <i><a class="issue tracker-9 status-6 priority-4 priority-default closed" href="/issues/41731">Backport #41731</a>: nautilus: osd/ReplicatedBackend.cc: 1349: FAILED ceph_assert(peer_missing.count(fromshard))</i> added</li></ul>
RADOS - Bug #41385: osd/ReplicatedBackend.cc: 1349: FAILED ceph_assert(peer_missing.count(fromshard))
https://tracker.ceph.com/issues/41385?journal_id=145664
2019-09-09T19:39:42Z
Nathan Cutler
ncutler@suse.cz
<ul><li><strong>Copied to</strong> <i><a class="issue tracker-9 status-6 priority-4 priority-default closed" href="/issues/41732">Backport #41732</a>: mimic: osd/ReplicatedBackend.cc: 1349: FAILED ceph_assert(peer_missing.count(fromshard))</i> added</li></ul>
RADOS - Bug #41385: osd/ReplicatedBackend.cc: 1349: FAILED ceph_assert(peer_missing.count(fromshard))
https://tracker.ceph.com/issues/41385?journal_id=145668
2019-09-09T19:41:02Z
Nathan Cutler
ncutler@suse.cz
<ul></ul><p>@Neha - backport all three PRs?</p>
RADOS - Bug #41385: osd/ReplicatedBackend.cc: 1349: FAILED ceph_assert(peer_missing.count(fromshard))
https://tracker.ceph.com/issues/41385?journal_id=145694
2019-09-09T21:03:26Z
Neha Ojha
nojha@redhat.com
<ul></ul><p>Nathan Cutler wrote:</p>
<blockquote>
<p>@Neha - backport all three PRs?</p>
</blockquote>
<p>Yes, note that the backport of <a class="external" href="https://github.com/ceph/ceph/pull/30059">https://github.com/ceph/ceph/pull/30059</a> should happen after <a class="external" href="https://github.com/ceph/ceph/pull/30271">https://github.com/ceph/ceph/pull/30271</a></p>
RADOS - Bug #41385: osd/ReplicatedBackend.cc: 1349: FAILED ceph_assert(peer_missing.count(fromshard))
https://tracker.ceph.com/issues/41385?journal_id=215079
2022-04-22T18:26:21Z
Neha Ojha
nojha@redhat.com
<ul><li><strong>Status</strong> changed from <i>Pending Backport</i> to <i>Resolved</i></li></ul>