https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2017-05-02T12:36:08ZCeph RADOS - Bug #18746: monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken)https://tracker.ceph.com/issues/18746?journal_id=906232017-05-02T12:36:08ZJason Dillamandillaman@redhat.com
<ul><li><strong>Priority</strong> changed from <i>Normal</i> to <i>Urgent</i></li></ul> RADOS - Bug #18746: monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken)https://tracker.ceph.com/issues/18746?journal_id=923832017-06-07T15:59:41ZGreg Farnumgfarnum@redhat.com
<ul><li><strong>Project</strong> changed from <i>Ceph</i> to <i>RADOS</i></li><li><strong>Category</strong> changed from <i>Monitor</i> to <i>Snapshots</i></li></ul><p>Pretty weird, that assert appears to be an internal interval_set consistency thing: <a class="external" href="https://github.com/ceph/ceph/blob/jewel/src/include/interval_set.h#L355">https://github.com/ceph/ceph/blob/jewel/src/include/interval_set.h#L355</a></p>
<p>Didn't see in a quick skim how it could fail either.</p> RADOS - Bug #18746: monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken)https://tracker.ceph.com/issues/18746?journal_id=924132017-06-08T11:00:37ZYiorgos Stamoulis
<ul></ul><p>Hi Greg,</p>
<p>Thank you for taking the time to look into this.</p>
<p>Following the incident of the present ticket the cluster was upgraded to Kraken. We then had a further issue (<a class="external" href="http://tracker.ceph.com/issues/19824">http://tracker.ceph.com/issues/19824</a>).</p>
<p>I believe my colleague Ross was given a hint in IRC the the problem could be stale/inconsistent metadata in a pool. We then recreated the pools and all has been good since. Hope this makes sense.</p>
<p>So, as far as our cluster is concerned all is good now. However, with the pointer that some bad pool metadata may have triggered this it may be good idea to come get to the bottom of this in the interest of ceph stability.</p> RADOS - Bug #18746: monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken)https://tracker.ceph.com/issues/18746?journal_id=930862017-06-17T07:57:50ZGreg Farnumgfarnum@redhat.com
<ul><li><strong>Duplicated by</strong> <i><a class="issue tracker-1 status-10 priority-4 priority-default closed" href="/issues/19824">Bug #19824</a>: Reccurance of #18746(Jewel) in (Kraken)</i> added</li></ul> RADOS - Bug #18746: monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken)https://tracker.ceph.com/issues/18746?journal_id=939452017-06-28T15:47:35ZSage Weilsage@newdream.net
<ul><li><strong>Subject</strong> changed from <i>monitors crashing ./include/interval_set.h: 355: FAILED assert(0)</i> to <i>monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel)</i></li></ul> RADOS - Bug #18746: monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken)https://tracker.ceph.com/issues/18746?journal_id=949742017-07-12T15:18:03ZSage Weilsage@newdream.net
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>Need More Info</i></li><li><strong>Priority</strong> changed from <i>Urgent</i> to <i>High</i></li></ul> RADOS - Bug #18746: monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken)https://tracker.ceph.com/issues/18746?journal_id=949752017-07-12T15:18:47ZSage Weilsage@newdream.net
<ul></ul><p>The fact that the error stopped when cinder was stopped makes me thing this was related to in-flight requests from those services. Maybe two racing pool snap updates were interfering?</p> RADOS - Bug #18746: monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken)https://tracker.ceph.com/issues/18746?journal_id=1029492017-11-29T17:08:56ZSage Weilsage@newdream.net
<ul><li><strong>Subject</strong> changed from <i>monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel)</i> to <i>monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken)</i></li></ul> RADOS - Bug #18746: monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken)https://tracker.ceph.com/issues/18746?journal_id=1073972018-02-14T22:02:23ZPaul Emmerichpaul.emmerich@oocero.de
<ul><li><strong>File</strong> <a href="/attachments/download/3242/ceph-mon-crash-on-delete.txt">ceph-mon-crash-on-delete.txt</a> <a class="icon-only icon-magnifier" title="View" href="/attachments/3242/ceph-mon-crash-on-delete.txt">View</a> added</li></ul><p>I'm seeing this on Luminous. Some kRBD clients are sending requests of death killing the active monitor.<br />No special config on either mons or clients.</p>
<pre>
-7> 2018-02-14 13:23:48.987841 7fc50dd00700 1 -- 10.255.0.30:6789/0 <== client.1621590 10.255.0.106:0/1162398434 5 ==== pool_op(delete unmanaged snap pool 6 auid 0 tid 28 name v0) v4 ==== 65+0+0 (3119176652 0 0) 0x5587d8ec0d80 con 0x5587d8f98800
-6> 2018-02-14 13:23:48.987849 7fc50dd00700 20 mon.ceph-dal-mon-a1@0(leader) e1 _ms_dispatch existing session 0x5587d891e580 for client.1621590 10.255.0.106:0/1162398434
-5> 2018-02-14 13:23:48.987854 7fc50dd00700 20 mon.ceph-dal-mon-a1@0(leader) e1 caps profile rbd
-4> 2018-02-14 13:23:48.987856 7fc50dd00700 5 mon.ceph-dal-mon-a1@0(leader).paxos(paxos updating c 4935163..4935904) is_readable = 1 - now=2018-02-14 13:23:48.987856 lease_expire=2018-02-14 13:23:53.929312 has v0 lc 4935904
-3> 2018-02-14 13:23:48.987864 7fc50dd00700 10 mon.ceph-dal-mon-a1@0(leader).osd e19702 preprocess_query pool_op(delete unmanaged snap pool 6 auid 0 tid 28 name v0) v4 from client.1621590 10.255.0.106:0/1162398434
-2> 2018-02-14 13:23:48.987879 7fc50dd00700 7 mon.ceph-dal-mon-a1@0(leader).osd e19702 prepare_update pool_op(delete unmanaged snap pool 6 auid 0 tid 28 name v0) v4 from client.1621590 10.255.0.106:0/1162398434
-1> 2018-02-14 13:23:48.987888 7fc50dd00700 10 mon.ceph-dal-mon-a1@0(leader).osd e19702 prepare_pool_op pool_op(delete unmanaged snap pool 6 auid 0 tid 28 name v0) v4
0> 2018-02-14 13:23:48.991163 7fc50dd00700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/include/interval_set.h: In function 'void interval_set<T>::insert(T, T, T*, T*) [with T = snapid_t]' thread 7fc50dd00700 time 2018-02-14 13:23:48.987900
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/include/interval_set.h: 396: FAILED assert(p->first > start+len)
ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x5587cebc0330]
2: (interval_set<snapid_t>::insert(snapid_t, snapid_t, snapid_t*, snapid_t*)+0x32b) [0x5587ced0965b]
3: (pg_pool_t::remove_unmanaged_snap(snapid_t)+0x4d) [0x5587cecf6a1d]
4: (OSDMonitor::prepare_pool_op(boost::intrusive_ptr<MonOpRequest>)+0xe50) [0x5587ceaeb0f0]
5: (OSDMonitor::prepare_update(boost::intrusive_ptr<MonOpRequest>)+0x336) [0x5587ceb22b16]
6: (PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0xaf8) [0x5587ceaaf908]
7: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x52f) [0x5587ce99281f]
8: (Monitor::_ms_dispatch(Message*)+0x7eb) [0x5587ce993e5b]
9: (Monitor::ms_dispatch(Message*)+0x23) [0x5587ce9bf923]
10: (DispatchQueue::entry()+0x792) [0x5587cee6c162]
11: (DispatchQueue::DispatchThread::entry()+0xd) [0x5587cec67b6d]
12: (()+0x7e25) [0x7fc5172e8e25]
13: (clone()+0x6d) [0x7fc51488334d]
</pre>
<p>Full log with debug mon 20/20 is attached.</p> RADOS - Bug #18746: monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken)https://tracker.ceph.com/issues/18746?journal_id=1074512018-02-15T16:13:01ZPaul Emmerichpaul.emmerich@oocero.de
<ul></ul><p>I've got a cluster here where this issue is 100% reproducible when trying to delete snapshots. Let me know if we can do/test anything to help debug this.</p> RADOS - Bug #18746: monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken)https://tracker.ceph.com/issues/18746?journal_id=1081772018-02-27T17:46:48ZPaul Emmerichpaul.emmerich@oocero.de
<ul></ul><p>Still happening on 12.2.4</p>
<pre>
-9> 2018-02-27 09:44:28.558861 7f721f786700 1 -- 10.255.0.30:6789/0 <== client.3417026 10.255.0.52:0/2767607806 5 ==== pool_op(delete unmanaged snap pool 6 auid 0 tid 17 name v0) v4 ==== 65+0+0 (3863449434 0 0) 0x55b72b5bb680 con 0x55b72b586000
-8> 2018-02-27 09:44:28.558868 7f721f786700 20 mon.ceph-dal-mon-a1@0(leader) e1 _ms_dispatch existing session 0x55b72bcf5d80 for client.3417026 10.255.0.52:0/2767607806
-7> 2018-02-27 09:44:28.558884 7f721f786700 20 mon.ceph-dal-mon-a1@0(leader) e1 caps profile rbd
-6> 2018-02-27 09:44:28.558887 7f721f786700 5 mon.ceph-dal-mon-a1@0(leader).paxos(paxos writing c 6153015..6153739) is_readable = 1 - now=2018-02-27 09:44:28.558887 lease_expire=2018-02-27 09:44:32.953713 has v0 lc 6153739
-5> 2018-02-27 09:44:28.558894 7f721f786700 10 mon.ceph-dal-mon-a1@0(leader).osd e24190 preprocess_query pool_op(delete unmanaged snap pool 6 auid 0 tid 17 name v0) v4 from client.3417026 10.255.0.52:0/2767607806
-4> 2018-02-27 09:44:28.558908 7f721f786700 7 mon.ceph-dal-mon-a1@0(leader).osd e24190 prepare_update pool_op(delete unmanaged snap pool 6 auid 0 tid 17 name v0) v4 from client.3417026 10.255.0.52:0/2767607806
-3> 2018-02-27 09:44:28.558915 7f721f786700 10 mon.ceph-dal-mon-a1@0(leader).osd e24190 prepare_pool_op pool_op(delete unmanaged snap pool 6 auid 0 tid 17 name v0) v4
-2> 2018-02-27 09:44:28.562240 7f721f786700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/include/interval_set.h: In function 'void interval_set<T>::insert(T, T, T*, T*) [with T = snapid_t]' thread 7f721f786700 time 2018-02-27 09:44:28.558926
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/include/interval_set.h: 396: FAILED assert(p->first > start+len)
ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55b7209a0470]
2: (interval_set<snapid_t>::insert(snapid_t, snapid_t, snapid_t*, snapid_t*)+0x32b) [0x55b720ae91ab]
3: (pg_pool_t::remove_unmanaged_snap(snapid_t)+0x4d) [0x55b720ad65ed]
4: (OSDMonitor::prepare_pool_op(boost::intrusive_ptr<MonOpRequest>)+0xe40) [0x55b7208cb4a0]
5: (OSDMonitor::prepare_update(boost::intrusive_ptr<MonOpRequest>)+0x33e) [0x55b720902f0e]
6: (PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0xaf8) [0x55b72088f6a8]
7: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x5af) [0x55b72076fc9f]
8: (Monitor::_ms_dispatch(Message*)+0x7eb) [0x55b72077128b]
9: (Monitor::ms_dispatch(Message*)+0x23) [0x55b72079d3f3]
10: (DispatchQueue::entry()+0x792) [0x55b720c4fd92]
11: (DispatchQueue::DispatchThread::entry()+0xd) [0x55b720a477fd]
12: (()+0x7e25) [0x7f722898fe25]
13: (clone()+0x6d) [0x7f7225b0734d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
</pre> RADOS - Bug #18746: monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken)https://tracker.ceph.com/issues/18746?journal_id=1081852018-02-27T18:49:03ZPaul Emmerichpaul.emmerich@oocero.de
<ul></ul><p>To summarize what I've figured out to reproduce this:</p>
<ul>
<li>both rbd client and mon are running 12.2.4, happened with 12.2.2 as well</li>
<li>crashes when running "rbd snap rm" on some snapshots</li>
<li>seen it happen on two different clusters, multiple different snapshots trigger this</li>
<li>things that the clusters and snapshots have in common: the crashing images are based on a non-flattened clone in an erasure-coded pool</li>
</ul>
<p>The OpenStack test referenced in the original description also seems to be using an image with a parent.</p> RADOS - Bug #18746: monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken)https://tracker.ceph.com/issues/18746?journal_id=1088672018-03-11T00:36:41ZPaul Emmerichpaul.emmerich@oocero.de
<ul></ul><p>Well, turns out there were both 12.2.1 and 12.2.4 clients doing snapshot operations. This messed up removed_snaps due to issue <a class="issue tracker-1 status-3 priority-7 priority-highest closed" title="Bug: rbd does not delete snaps in (ec) data pool (Resolved)" href="https://tracker.ceph.com/issues/21567">#21567</a> with data_pool handling in the librbd snapshot functionality.</p>
<p>So my bug is different from this issue originally described here since it predates data_pool.</p> RADOS - Bug #18746: monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken)https://tracker.ceph.com/issues/18746?journal_id=1088702018-03-11T02:19:48ZPaul Emmerichpaul.emmerich@oocero.de
<ul></ul><p>Anyways, the only place where this can happen is if <code>snap_seq < max(removed_snaps)</code> because the deletion request inserts two elements into removed_snaps but only checks the precondition for one insert. The precondition for the second one will be violated if anyone ever sent a snap delete request for a snap id larger than snap_seq (which shouldn't be allowed).</p>
<p>I've fixed both of this in <a class="external" href="https://github.com/ceph/ceph/pull/20835">https://github.com/ceph/ceph/pull/20835</a></p> RADOS - Bug #18746: monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken)https://tracker.ceph.com/issues/18746?journal_id=1101852018-04-04T21:21:10ZGreg Farnumgfarnum@redhat.com
<ul><li><strong>Status</strong> changed from <i>Need More Info</i> to <i>4</i></li></ul><p>Under discussion on the PR, which is good on its own terms but suffering from a prior CephFS bug. :(</p> RADOS - Bug #18746: monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken)https://tracker.ceph.com/issues/18746?journal_id=1107912018-04-10T17:47:07ZGreg Farnumgfarnum@redhat.com
<ul><li><strong>Status</strong> changed from <i>4</i> to <i>Fix Under Review</i></li></ul> RADOS - Bug #18746: monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken)https://tracker.ceph.com/issues/18746?journal_id=1121992018-04-27T19:30:07ZSage Weilsage@newdream.net
<ul><li><strong>Status</strong> changed from <i>Fix Under Review</i> to <i>Pending Backport</i></li><li><strong>Backport</strong> set to <i>luminous</i></li></ul> RADOS - Bug #18746: monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken)https://tracker.ceph.com/issues/18746?journal_id=1122062018-04-27T20:07:24ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Copied to</strong> <i><a class="issue tracker-9 status-3 priority-4 priority-default closed" href="/issues/23915">Backport #23915</a>: luminous: monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken)</i> added</li></ul> RADOS - Bug #18746: monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken)https://tracker.ceph.com/issues/18746?journal_id=1438402019-08-21T20:37:18ZSamuel Justsjust@redhat.com
<ul><li><strong>Status</strong> changed from <i>Pending Backport</i> to <i>Resolved</i></li></ul><p>Already backported to luminous.</p>