Project

General

Profile

Bug #18746

monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken)

Added by Yiorgos Stamoulis almost 2 years ago. Updated 9 months ago.

Status:
Pending Backport
Priority:
High
Assignee:
-
Category:
Snapshots
Target version:
-
Start date:
01/30/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
luminous
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:

Description

Afternoon! It would be great if anyone could shed any light on a pretty serious issue we had last week.

Essentially we had 2 out of 3 monitors of a cluster fail within seconds of each other with (extract from ceph-mon.monitor-2.log shown, see attachments for more details), thus causing the third monitor to be unable to reach quorum, leaving out ceph cluster to grind to a halt!


2017-01-27 15:58:37.309514 7f82ff93f700 -1 ./include/interval_set.h: In function 'void interval_set<T>::insert(T, T, T*, T*) [with T = snapid_t]' thread 7f82ff93f700 time 2017-01-27 15:58:37.305538
./include/interval_set.h: 355: FAILED assert(0)

 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x5558c6244bfb]
 2: (interval_set<snapid_t>::insert(snapid_t, snapid_t, snapid_t*, snapid_t*)+0x35c) [0x5558c6309c3c]
 3: (pg_pool_t::remove_unmanaged_snap(snapid_t)+0x4d) [0x5558c62ffa2d]
 4: (OSDMonitor::prepare_pool_op(std::shared_ptr<MonOpRequest>)+0xe34) [0x5558c5f39864]
 5: (OSDMonitor::prepare_update(std::shared_ptr<MonOpRequest>)+0x28f) [0x5558c5f5b8bf]
 6: (PaxosService::dispatch(std::shared_ptr<MonOpRequest>)+0xdab) [0x5558c5f0b24b]
 7: (PaxosService::C_RetryMessage::_finish(int)+0x54) [0x5558c5f0e824]
 8: (C_MonOp::finish(int)+0x69) [0x5558c5edb539]
 9: (Context::complete(int)+0x9) [0x5558c5eda6d9]
 10: (void finish_contexts<Context>(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0x94) [0x5558c5ee0934]
 11: (Paxos::finish_round()+0x10b) [0x5558c5f0306b]
 12: (Paxos::handle_last(std::shared_ptr<MonOpRequest>)+0xee4) [0x5558c5f044e4]
 13: (Paxos::dispatch(std::shared_ptr<MonOpRequest>)+0x2e4) [0x5558c5f04e74]
 14: (Monitor::dispatch_op(std::shared_ptr<MonOpRequest>)+0xb75) [0x5558c5ed5b05]
 15: (Monitor::_ms_dispatch(Message*)+0x6c1) [0x5558c5ed65f1]
 16: (Monitor::ms_dispatch(Message*)+0x23) [0x5558c5ef5873]
 17: (DispatchQueue::entry()+0x78b) [0x5558c632d58b]
 18: (DispatchQueue::DispatchThread::entry()+0xd) [0x5558c622a68d]
 19: (()+0x8184) [0x7f830881e184]
 20: (clone()+0x6d) [0x7f8306b7037d]

This is cluster supporting OpenStack (cinder & glance, liberty release), currently under testing.

Standard practice at our company is to run refstack (https://wiki.openstack.org/wiki/RefStack), a tool that tests OpenStack functionality, and we believe test tempest.api.volume.test_volumes_get.VolumesV2GetTest.test_volume_create_get_update_delete_as_clone[id-3f591b4a-7dc6-444c-bd51-77469506b3a1] (https://github.com/openstack/tempest/blob/master/tempest/api/volume/test_volumes_get.py) triggered the unexpected response from ceph

Subsequent restarts of ceph monitors failed until we stopped cinder & nova services on the OpenStack Cluster. After that, both clusters were able to recover.

We have tried, but were unable, to replicate the crash.

In order to ensure the availability of the cluster we would like to determine the conditions that caused the monitor crashes and whether they were indeed related to refstack actions or something entirely different.

ceph-mon.log.tar.gz - monitor logs (410 KB) Yiorgos Stamoulis, 01/31/2017 02:13 PM

ceph-mon-crash-on-delete.txt View (599 KB) Paul Emmerich, 02/14/2018 10:02 PM


Related issues

Duplicated by Ceph - Bug #19824: Reccurance of #18746(Jewel) in (Kraken) Duplicate 05/02/2017
Copied to RADOS - Backport #23915: luminous: monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken) Resolved

Associated revisions

Revision f42a6ba4 (diff)
Added by Paul Emmerich 10 months ago

mon: handle bad snapshot removal reqs gracefully

Snapshot deletion requests on snap ids larger than the snap_seq of
the pool will leave the pool in a state with snap_seq being less
than max(removed_snaps).

This is bad because further deletion requests to a pool in this state
might crash the mon in some cases: the deletion also inserts the new
snap_seq into the removed_snaps set -- which might already exist
in this case and trigger an assert.

Such bad requests will be generated by rbd clients without a fix for
issue #21567.

The change in OSDMonitor prevents pools from getting into this state
and may prevent old broken clients from incorrectly deleting snaps.
The change in osd_types avoids a crash if a pool is already in this
state.

Fixes #18746

Signed-off-by: Paul Emmerich <>

Revision e2632826 (diff)
Added by Paul Emmerich 9 months ago

mon: handle bad snapshot removal reqs gracefully

Snapshot deletion requests on snap ids larger than the snap_seq of
the pool will leave the pool in a state with snap_seq being less
than max(removed_snaps).

This is bad because further deletion requests to a pool in this state
might crash the mon in some cases: the deletion also inserts the new
snap_seq into the removed_snaps set -- which might already exist
in this case and trigger an assert.

Such bad requests will be generated by rbd clients without a fix for
issue #21567.

The change in OSDMonitor prevents pools from getting into this state
and may prevent old broken clients from incorrectly deleting snaps.
The change in osd_types avoids a crash if a pool is already in this
state.

Fixes #18746

Signed-off-by: Paul Emmerich <>
(cherry picked from commit f42a6ba4c37cb10869718ec296352ae55254b44a)

Conflicts:
src/osd/osd_types.cc

History

#1 Updated by Jason Dillaman over 1 year ago

  • Priority changed from Normal to Urgent

#2 Updated by Greg Farnum over 1 year ago

  • Project changed from Ceph to RADOS
  • Category changed from Monitor to Snapshots

Pretty weird, that assert appears to be an internal interval_set consistency thing: https://github.com/ceph/ceph/blob/jewel/src/include/interval_set.h#L355

Didn't see in a quick skim how it could fail either.

#3 Updated by Yiorgos Stamoulis over 1 year ago

Hi Greg,

Thank you for taking the time to look into this.

Following the incident of the present ticket the cluster was upgraded to Kraken. We then had a further issue (http://tracker.ceph.com/issues/19824).

I believe my colleague Ross was given a hint in IRC the the problem could be stale/inconsistent metadata in a pool. We then recreated the pools and all has been good since. Hope this makes sense.

So, as far as our cluster is concerned all is good now. However, with the pointer that some bad pool metadata may have triggered this it may be good idea to come get to the bottom of this in the interest of ceph stability.

#4 Updated by Greg Farnum over 1 year ago

  • Duplicated by Bug #19824: Reccurance of #18746(Jewel) in (Kraken) added

#5 Updated by Sage Weil over 1 year ago

  • Subject changed from monitors crashing ./include/interval_set.h: 355: FAILED assert(0) to monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel)

#6 Updated by Sage Weil over 1 year ago

  • Status changed from New to Need More Info
  • Priority changed from Urgent to High

#7 Updated by Sage Weil over 1 year ago

The fact that the error stopped when cinder was stopped makes me thing this was related to in-flight requests from those services. Maybe two racing pool snap updates were interfering?

#8 Updated by Sage Weil about 1 year ago

  • Subject changed from monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel) to monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken)

#9 Updated by Paul Emmerich 11 months ago

I'm seeing this on Luminous. Some kRBD clients are sending requests of death killing the active monitor.
No special config on either mons or clients.

    -7> 2018-02-14 13:23:48.987841 7fc50dd00700  1 -- 10.255.0.30:6789/0 <== client.1621590 10.255.0.106:0/1162398434 5 ==== pool_op(delete unmanaged snap pool 6 auid 0 tid 28 name  v0) v4 ==== 65+0+0 (3119176652 0 0) 0x5587d8ec0d80 con 0x5587d8f98800
    -6> 2018-02-14 13:23:48.987849 7fc50dd00700 20 mon.ceph-dal-mon-a1@0(leader) e1 _ms_dispatch existing session 0x5587d891e580 for client.1621590 10.255.0.106:0/1162398434
    -5> 2018-02-14 13:23:48.987854 7fc50dd00700 20 mon.ceph-dal-mon-a1@0(leader) e1  caps profile rbd
    -4> 2018-02-14 13:23:48.987856 7fc50dd00700  5 mon.ceph-dal-mon-a1@0(leader).paxos(paxos updating c 4935163..4935904) is_readable = 1 - now=2018-02-14 13:23:48.987856 lease_expire=2018-02-14 13:23:53.929312 has v0 lc 4935904
    -3> 2018-02-14 13:23:48.987864 7fc50dd00700 10 mon.ceph-dal-mon-a1@0(leader).osd e19702 preprocess_query pool_op(delete unmanaged snap pool 6 auid 0 tid 28 name  v0) v4 from client.1621590 10.255.0.106:0/1162398434
    -2> 2018-02-14 13:23:48.987879 7fc50dd00700  7 mon.ceph-dal-mon-a1@0(leader).osd e19702 prepare_update pool_op(delete unmanaged snap pool 6 auid 0 tid 28 name  v0) v4 from client.1621590 10.255.0.106:0/1162398434
    -1> 2018-02-14 13:23:48.987888 7fc50dd00700 10 mon.ceph-dal-mon-a1@0(leader).osd e19702 prepare_pool_op pool_op(delete unmanaged snap pool 6 auid 0 tid 28 name  v0) v4
     0> 2018-02-14 13:23:48.991163 7fc50dd00700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/include/interval_set.h: In function 'void interval_set<T>::insert(T, T, T*, T*) [with T = snapid_t]' thread 7fc50dd00700 time 2018-02-14 13:23:48.987900
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/include/interval_set.h: 396: FAILED assert(p->first > start+len)

 ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x5587cebc0330]
 2: (interval_set<snapid_t>::insert(snapid_t, snapid_t, snapid_t*, snapid_t*)+0x32b) [0x5587ced0965b]
 3: (pg_pool_t::remove_unmanaged_snap(snapid_t)+0x4d) [0x5587cecf6a1d]
 4: (OSDMonitor::prepare_pool_op(boost::intrusive_ptr<MonOpRequest>)+0xe50) [0x5587ceaeb0f0]
 5: (OSDMonitor::prepare_update(boost::intrusive_ptr<MonOpRequest>)+0x336) [0x5587ceb22b16]
 6: (PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0xaf8) [0x5587ceaaf908]
 7: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x52f) [0x5587ce99281f]
 8: (Monitor::_ms_dispatch(Message*)+0x7eb) [0x5587ce993e5b]
 9: (Monitor::ms_dispatch(Message*)+0x23) [0x5587ce9bf923]
 10: (DispatchQueue::entry()+0x792) [0x5587cee6c162]
 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x5587cec67b6d]
 12: (()+0x7e25) [0x7fc5172e8e25]
 13: (clone()+0x6d) [0x7fc51488334d]

Full log with debug mon 20/20 is attached.

#10 Updated by Paul Emmerich 11 months ago

I've got a cluster here where this issue is 100% reproducible when trying to delete snapshots. Let me know if we can do/test anything to help debug this.

#11 Updated by Paul Emmerich 11 months ago

Still happening on 12.2.4

    -9> 2018-02-27 09:44:28.558861 7f721f786700  1 -- 10.255.0.30:6789/0 <== client.3417026 10.255.0.52:0/2767607806 5 ==== pool_op(delete unmanaged snap pool 6 auid 0 tid 17 name  v0) v4 ==== 65+0+0 (3863449434 0 0) 0x55b72b5bb680 con 0x55b72b586000
    -8> 2018-02-27 09:44:28.558868 7f721f786700 20 mon.ceph-dal-mon-a1@0(leader) e1 _ms_dispatch existing session 0x55b72bcf5d80 for client.3417026 10.255.0.52:0/2767607806
    -7> 2018-02-27 09:44:28.558884 7f721f786700 20 mon.ceph-dal-mon-a1@0(leader) e1  caps profile rbd
    -6> 2018-02-27 09:44:28.558887 7f721f786700  5 mon.ceph-dal-mon-a1@0(leader).paxos(paxos writing c 6153015..6153739) is_readable = 1 - now=2018-02-27 09:44:28.558887 lease_expire=2018-02-27 09:44:32.953713 has v0 lc 6153739
    -5> 2018-02-27 09:44:28.558894 7f721f786700 10 mon.ceph-dal-mon-a1@0(leader).osd e24190 preprocess_query pool_op(delete unmanaged snap pool 6 auid 0 tid 17 name  v0) v4 from client.3417026 10.255.0.52:0/2767607806
    -4> 2018-02-27 09:44:28.558908 7f721f786700  7 mon.ceph-dal-mon-a1@0(leader).osd e24190 prepare_update pool_op(delete unmanaged snap pool 6 auid 0 tid 17 name  v0) v4 from client.3417026 10.255.0.52:0/2767607806
    -3> 2018-02-27 09:44:28.558915 7f721f786700 10 mon.ceph-dal-mon-a1@0(leader).osd e24190 prepare_pool_op pool_op(delete unmanaged snap pool 6 auid 0 tid 17 name  v0) v4
    -2> 2018-02-27 09:44:28.562240 7f721f786700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/include/interval_set.h: In function 'void interval_set<T>::insert(T, T, T*, T*) [with T = snapid_t]' thread 7f721f786700 time 2018-02-27 09:44:28.558926
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/include/interval_set.h: 396: FAILED assert(p->first > start+len)

 ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55b7209a0470]
 2: (interval_set<snapid_t>::insert(snapid_t, snapid_t, snapid_t*, snapid_t*)+0x32b) [0x55b720ae91ab]
 3: (pg_pool_t::remove_unmanaged_snap(snapid_t)+0x4d) [0x55b720ad65ed]
 4: (OSDMonitor::prepare_pool_op(boost::intrusive_ptr<MonOpRequest>)+0xe40) [0x55b7208cb4a0]
 5: (OSDMonitor::prepare_update(boost::intrusive_ptr<MonOpRequest>)+0x33e) [0x55b720902f0e]
 6: (PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0xaf8) [0x55b72088f6a8]
 7: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x5af) [0x55b72076fc9f]
 8: (Monitor::_ms_dispatch(Message*)+0x7eb) [0x55b72077128b]
 9: (Monitor::ms_dispatch(Message*)+0x23) [0x55b72079d3f3]
 10: (DispatchQueue::entry()+0x792) [0x55b720c4fd92]
 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x55b720a477fd]
 12: (()+0x7e25) [0x7f722898fe25]
 13: (clone()+0x6d) [0x7f7225b0734d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#12 Updated by Paul Emmerich 11 months ago

To summarize what I've figured out to reproduce this:

  • both rbd client and mon are running 12.2.4, happened with 12.2.2 as well
  • crashes when running "rbd snap rm" on some snapshots
  • seen it happen on two different clusters, multiple different snapshots trigger this
  • things that the clusters and snapshots have in common: the crashing images are based on a non-flattened clone in an erasure-coded pool

The OpenStack test referenced in the original description also seems to be using an image with a parent.

#13 Updated by Paul Emmerich 10 months ago

Well, turns out there were both 12.2.1 and 12.2.4 clients doing snapshot operations. This messed up removed_snaps due to issue #21567 with data_pool handling in the librbd snapshot functionality.

So my bug is different from this issue originally described here since it predates data_pool.

#14 Updated by Paul Emmerich 10 months ago

Anyways, the only place where this can happen is if snap_seq < max(removed_snaps) because the deletion request inserts two elements into removed_snaps but only checks the precondition for one insert. The precondition for the second one will be violated if anyone ever sent a snap delete request for a snap id larger than snap_seq (which shouldn't be allowed).

I've fixed both of this in https://github.com/ceph/ceph/pull/20835

#15 Updated by Greg Farnum 10 months ago

  • Status changed from Need More Info to Feedback

Under discussion on the PR, which is good on its own terms but suffering from a prior CephFS bug. :(

#16 Updated by Greg Farnum 9 months ago

  • Status changed from Feedback to Need Review

#17 Updated by Sage Weil 9 months ago

  • Status changed from Need Review to Pending Backport
  • Backport set to luminous

#18 Updated by Nathan Cutler 9 months ago

  • Copied to Backport #23915: luminous: monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken) added

Also available in: Atom PDF