https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2017-11-14T14:46:52ZCeph RADOS - Bug #22123: osd: objecter sends out of sync with pg epochs for proxied opshttps://tracker.ceph.com/issues/22123?journal_id=1022222017-11-14T14:46:52ZSage Weilsage@newdream.net
<ul></ul><p>enqueue_front<br /> client.4264.0:7826<br /> client.4264.0:7825<br /> client.4264.0:7824<br /> client.4264.0:7823<br />enqueue_front shuffle<br /> client.4264.0:7809 with client.4264.0:7824 (!!! this should be 7823 !!!)<br /> client.4264.0:7808 with client.4264.0:7823<br />...</p> RADOS - Bug #22123: osd: objecter sends out of sync with pg epochs for proxied opshttps://tracker.ceph.com/issues/22123?journal_id=1022672017-11-15T14:30:57ZSage Weilsage@newdream.net
<ul><li><strong>Subject</strong> changed from <i>osd: misordered ops after enqueue_front shuffling</i> to <i>osd: objecter sends out of sync with pg epochs for proxied ops</i></li><li><strong>Priority</strong> changed from <i>Urgent</i> to <i>High</i></li></ul><p>Okay, the shuffling is actually fine.</p>
<p>Real problem appears to be that objecter osdmap updates race with pg map updates, and the requeueing of proxied ops can race with teh objecter atcually sending those ops:</p>
<p>- start proxy op A (epoch E)<br />- start proxy op B (epoch E)<br />- start proxy op C (epoch E)<br />- objecter sends none of these because target is down in epoch E<br />- osdmap update to E+1<br />- pg cancels requeues A, B<br />- objecter updates to E+1<br />- objecter sends C<br />- pg cancels/requeues C</p>
<p>actually, it's not even that, it's harder:</p>
<p>- pg cancels A, B<br />- object receives backoff release, sends C<br />- pg cancels C</p>
<pre>
2017-11-14 00:37:44.817 7fbba2be1700 10 osd.7 pg_epoch: 134 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=116/117 n=51 ec=20/20 lis/c 116/116 les/c/f 117/117/0 134/134/116) [5,7]/[7,2] r=0 lpr=134 pi=[116,134)/1 crt=97'367 lcod 97'366 mlcod 0'0 remapped] cancel_proxy_write 3:14993ae1:::smithi1441403-11 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head
2017-11-14 00:37:44.817 7fbba2be1700 10 osd.7 pg_epoch: 134 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=116/117 n=51 ec=20/20 lis/c 116/116 les/c/f 117/117/0 134/134/116) [5,7]/[7,2] r=0 lpr=134 pi=[116,134)/1 crt=97'367 lcod 97'366 mlcod 0'0 remapped] cancel_proxy_write 3:14993ae1:::smithi1441403-11 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head
2017-11-14 00:37:44.817 7fbba53e6700 1 -- 172.21.15.1:0/14841 <== osd.3 172.21.15.144:6812/27580 2 ==== osd_backoff(2.0 unblock id 1 [2:00000000::::head,2:40000000::::head) e134) v1 ==== 115+0+0 (1476801449 0 0) 0x55bfc5eae780 con 0x55bfc52a6800
2017-11-14 00:37:44.817 7fbba2be1700 10 osd.7 pg_epoch: 134 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=116/117 n=51 ec=20/20 lis/c 116/116 les/c/f 117/117/0 134/134/116) [5,7]/[7,2] r=0 lpr=134 pi=[116,134)/1 crt=97'367 lcod 97'366 mlcod 0'0 remapped] cancel_proxy_write 3:14993ae1:::smithi1441403-11 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head
2017-11-14 00:37:44.818 7fbba53e6700 1 -- 172.21.15.1:0/14841 --> 172.21.15.144:6812/27580 -- osd_op(client.4264.0:7825 2.0 2:14993ae1:::smithi1441403-11 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head [setxattr _header (356),truncate 3193660] snapc 0=[] RETRY=1 ondisk+retry+write+ignore_cache+ignore_overlay+known_if_redirected e134) v8 -- ?+0 0x55bfc8c0c680 con 0x55bfc52a6800
</pre>
<p>one fix might be to simply not cancel proxy ops since we can't cancel them all atomically.</p>
<p>another fix might be to generate a list of all ops to cancel and cancel them all atomically in one objecter call.</p> RADOS - Bug #22123: osd: objecter sends out of sync with pg epochs for proxied opshttps://tracker.ceph.com/issues/22123?journal_id=1022682017-11-15T14:31:13ZSage Weilsage@newdream.net
<ul><li><strong>Project</strong> changed from <i>Ceph</i> to <i>RADOS</i></li><li><strong>Category</strong> set to <i>Tiering</i></li></ul> RADOS - Bug #22123: osd: objecter sends out of sync with pg epochs for proxied opshttps://tracker.ceph.com/issues/22123?journal_id=1029202017-11-29T16:46:19ZGreg Farnumgfarnum@redhat.com
<ul></ul><p>We could also cancel them in reverse order; that won't break ordering and doesn't require new interfaces.</p> RADOS - Bug #22123: osd: objecter sends out of sync with pg epochs for proxied opshttps://tracker.ceph.com/issues/22123?journal_id=1029212017-11-29T16:46:28ZSage Weilsage@newdream.net
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>12</i></li></ul> RADOS - Bug #22123: osd: objecter sends out of sync with pg epochs for proxied opshttps://tracker.ceph.com/issues/22123?journal_id=1055182018-01-22T21:36:25ZDavid Zafmandzafman@redhat.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-1 status-10 priority-7 priority-highest closed" href="/issues/22064">Bug #22064</a>: "RadosModel.h: 865: FAILED assert(0)" in rados-jewel-distro-basic-smithi</i> added</li></ul> RADOS - Bug #22123: osd: objecter sends out of sync with pg epochs for proxied opshttps://tracker.ceph.com/issues/22123?journal_id=1075352018-02-16T16:59:17ZDavid Zafmandzafman@redhat.com
<ul><li><strong>Related to</strong> deleted (<i><a class="issue tracker-1 status-10 priority-7 priority-highest closed" href="/issues/22064">Bug #22064</a>: "RadosModel.h: 865: FAILED assert(0)" in rados-jewel-distro-basic-smithi</i>)</li></ul> RADOS - Bug #22123: osd: objecter sends out of sync with pg epochs for proxied opshttps://tracker.ceph.com/issues/22123?journal_id=1075372018-02-16T16:59:29ZDavid Zafmandzafman@redhat.com
<ul><li><strong>Duplicated by</strong> <i><a class="issue tracker-1 status-10 priority-7 priority-highest closed" href="/issues/22064">Bug #22064</a>: "RadosModel.h: 865: FAILED assert(0)" in rados-jewel-distro-basic-smithi</i> added</li></ul> RADOS - Bug #22123: osd: objecter sends out of sync with pg epochs for proxied opshttps://tracker.ceph.com/issues/22123?journal_id=1075402018-02-16T17:00:00ZDavid Zafmandzafman@redhat.com
<ul><li><strong>Duplicated by</strong> <i><a class="issue tracker-1 status-10 priority-6 priority-high2 closed" href="/issues/22063">Bug #22063</a>: "RadosModel.h: 1703: FAILED assert(!version || comp->get_version64() == version)" inrados-jewel-distro-basic-smithi </i> added</li></ul> RADOS - Bug #22123: osd: objecter sends out of sync with pg epochs for proxied opshttps://tracker.ceph.com/issues/22123?journal_id=1075472018-02-16T19:52:57ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Related to</strong> <i><a class="issue tracker-1 status-1 priority-4 priority-default" href="/issues/22743">Bug #22743</a>: "RadosModel.h: 854: FAILED assert(0)" in upgrade:hammer-x-jewel-distro-basic-smithi</i> added</li></ul> RADOS - Bug #22123: osd: objecter sends out of sync with pg epochs for proxied opshttps://tracker.ceph.com/issues/22123?journal_id=1076022018-02-19T14:55:41ZSage Weilsage@newdream.net
<ul><li><strong>Status</strong> changed from <i>12</i> to <i>Fix Under Review</i></li></ul><p><a class="external" href="https://github.com/ceph/ceph/pull/20484">https://github.com/ceph/ceph/pull/20484</a></p>
<p>I opted for the marginally more complex solution of cancelling multiple ops atomically. The objecter change to do it is trivial.</p> RADOS - Bug #22123: osd: objecter sends out of sync with pg epochs for proxied opshttps://tracker.ceph.com/issues/22123?journal_id=1077752018-02-21T16:34:37ZSage Weilsage@newdream.net
<ul><li><strong>Status</strong> changed from <i>Fix Under Review</i> to <i>Pending Backport</i></li><li><strong>Backport</strong> set to <i>luminous</i></li></ul> RADOS - Bug #22123: osd: objecter sends out of sync with pg epochs for proxied opshttps://tracker.ceph.com/issues/22123?journal_id=1077882018-02-21T21:19:45ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Backport</strong> changed from <i>luminous</i> to <i>luminous, jewel</i></li></ul> RADOS - Bug #22123: osd: objecter sends out of sync with pg epochs for proxied opshttps://tracker.ceph.com/issues/22123?journal_id=1077912018-02-21T21:26:50ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Copied to</strong> <i><a class="issue tracker-9 status-3 priority-4 priority-default closed" href="/issues/23075">Backport #23075</a>: luminous: osd: objecter sends out of sync with pg epochs for proxied ops</i> added</li></ul> RADOS - Bug #22123: osd: objecter sends out of sync with pg epochs for proxied opshttps://tracker.ceph.com/issues/22123?journal_id=1077932018-02-21T21:26:52ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Copied to</strong> <i><a class="issue tracker-9 status-3 priority-4 priority-default closed" href="/issues/23076">Backport #23076</a>: jewel: osd: objecter sends out of sync with pg epochs for proxied ops</i> added</li></ul> RADOS - Bug #22123: osd: objecter sends out of sync with pg epochs for proxied opshttps://tracker.ceph.com/issues/22123?journal_id=1099482018-03-30T17:10:06ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Status</strong> changed from <i>Pending Backport</i> to <i>Resolved</i></li></ul> RADOS - Bug #22123: osd: objecter sends out of sync with pg epochs for proxied opshttps://tracker.ceph.com/issues/22123?journal_id=1104882018-04-09T02:08:29ZNathan Cutlerncutler@suse.cz
<ul></ul><p>Despite the jewel backport of this fix being merged, this problem has reappeared in jewel 10.2.11 integration testing. See <a class="issue tracker-1 status-10 priority-6 priority-high2 closed" title="Bug: hammer->jewel: ceph_test_rados crashes during radosbench task in jewel rados upgrade test (Duplicate)" href="https://tracker.ceph.com/issues/23598">#23598</a></p> RADOS - Bug #22123: osd: objecter sends out of sync with pg epochs for proxied opshttps://tracker.ceph.com/issues/22123?journal_id=1104892018-04-09T02:08:45ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Related to</strong> <i><a class="issue tracker-1 status-10 priority-6 priority-high2 closed" href="/issues/23598">Bug #23598</a>: hammer->jewel: ceph_test_rados crashes during radosbench task in jewel rados upgrade test</i> added</li></ul>