Bug #18746: monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken) - RADOS - Ceph

Actions

Copy link

Bug #18746

closed

monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken)

Added by Yiorgos Stamoulis about 7 years ago. Updated over 4 years ago.

Status:

Resolved

Priority:

High

Assignee:

Category:

Snapshots

Target version:

% Done:

Source:

Tags:

Backport:

luminous

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v10.2.2

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Afternoon! It would be great if anyone could shed any light on a pretty serious issue we had last week.

Essentially we had 2 out of 3 monitors of a cluster fail within seconds of each other with (extract from ceph-mon.monitor-2.log shown, see attachments for more details), thus causing the third monitor to be unable to reach quorum, leaving out ceph cluster to grind to a halt!


2017-01-27 15:58:37.309514 7f82ff93f700 -1 ./include/interval_set.h: In function 'void interval_set<T>::insert(T, T, T*, T*) [with T = snapid_t]' thread 7f82ff93f700 time 2017-01-27 15:58:37.305538
./include/interval_set.h: 355: FAILED assert(0)

 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x5558c6244bfb]
 2: (interval_set<snapid_t>::insert(snapid_t, snapid_t, snapid_t*, snapid_t*)+0x35c) [0x5558c6309c3c]
 3: (pg_pool_t::remove_unmanaged_snap(snapid_t)+0x4d) [0x5558c62ffa2d]
 4: (OSDMonitor::prepare_pool_op(std::shared_ptr<MonOpRequest>)+0xe34) [0x5558c5f39864]
 5: (OSDMonitor::prepare_update(std::shared_ptr<MonOpRequest>)+0x28f) [0x5558c5f5b8bf]
 6: (PaxosService::dispatch(std::shared_ptr<MonOpRequest>)+0xdab) [0x5558c5f0b24b]
 7: (PaxosService::C_RetryMessage::_finish(int)+0x54) [0x5558c5f0e824]
 8: (C_MonOp::finish(int)+0x69) [0x5558c5edb539]
 9: (Context::complete(int)+0x9) [0x5558c5eda6d9]
 10: (void finish_contexts<Context>(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0x94) [0x5558c5ee0934]
 11: (Paxos::finish_round()+0x10b) [0x5558c5f0306b]
 12: (Paxos::handle_last(std::shared_ptr<MonOpRequest>)+0xee4) [0x5558c5f044e4]
 13: (Paxos::dispatch(std::shared_ptr<MonOpRequest>)+0x2e4) [0x5558c5f04e74]
 14: (Monitor::dispatch_op(std::shared_ptr<MonOpRequest>)+0xb75) [0x5558c5ed5b05]
 15: (Monitor::_ms_dispatch(Message*)+0x6c1) [0x5558c5ed65f1]
 16: (Monitor::ms_dispatch(Message*)+0x23) [0x5558c5ef5873]
 17: (DispatchQueue::entry()+0x78b) [0x5558c632d58b]
 18: (DispatchQueue::DispatchThread::entry()+0xd) [0x5558c622a68d]
 19: (()+0x8184) [0x7f830881e184]
 20: (clone()+0x6d) [0x7f8306b7037d]

This is cluster supporting OpenStack (cinder & glance, liberty release), currently under testing.

Standard practice at our company is to run refstack (https://wiki.openstack.org/wiki/RefStack), a tool that tests OpenStack functionality, and we believe test tempest.api.volume.test_volumes_get.VolumesV2GetTest.test_volume_create_get_update_delete_as_clone[id-3f591b4a-7dc6-444c-bd51-77469506b3a1] (https://github.com/openstack/tempest/blob/master/tempest/api/volume/test_volumes_get.py) triggered the unexpected response from ceph

Subsequent restarts of ceph monitors failed until we stopped cinder & nova services on the OpenStack Cluster. After that, both clusters were able to recover.

We have tried, but were unable, to replicate the crash.

In order to ensure the availability of the cluster we would like to determine the conditions that caused the monitor crashes and whether they were indeed related to refstack actions or something entirely different.

Files

Download all files

ceph-mon.log.tar.gz (410 KB) ceph-mon.log.tar.gz	monitor logs	Yiorgos Stamoulis, 01/31/2017 02:13 PM
ceph-mon-crash-on-delete.txt (599 KB) ceph-mon-crash-on-delete.txt		Paul Emmerich, 02/14/2018 10:02 PM

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Jason Dillaman almost 7 years ago

Priority changed from Normal to Urgent

Actions

Copy link

Updated by Greg Farnum almost 7 years ago

Project changed from Ceph to RADOS
Category changed from Monitor to Snapshots

Pretty weird, that assert appears to be an internal interval_set consistency thing: https://github.com/ceph/ceph/blob/jewel/src/include/interval_set.h#L355

Didn't see in a quick skim how it could fail either.

Actions

Copy link

Updated by Yiorgos Stamoulis almost 7 years ago

Hi Greg,

Thank you for taking the time to look into this.

Following the incident of the present ticket the cluster was upgraded to Kraken. We then had a further issue (http://tracker.ceph.com/issues/19824).

I believe my colleague Ross was given a hint in IRC the the problem could be stale/inconsistent metadata in a pool. We then recreated the pools and all has been good since. Hope this makes sense.

So, as far as our cluster is concerned all is good now. However, with the pointer that some bad pool metadata may have triggered this it may be good idea to come get to the bottom of this in the interest of ceph stability.

Actions

Copy link

Updated by Greg Farnum almost 7 years ago

Has duplicate Bug #19824: Reccurance of #18746(Jewel) in (Kraken) added

Actions

Copy link

Updated by Sage Weil almost 7 years ago

Subject changed from monitors crashing ./include/interval_set.h: 355: FAILED assert(0) to monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel)

Actions

Copy link

Updated by Sage Weil almost 7 years ago

Status changed from New to Need More Info
Priority changed from Urgent to High

Actions

Copy link

Updated by Sage Weil almost 7 years ago

The fact that the error stopped when cinder was stopped makes me thing this was related to in-flight requests from those services. Maybe two racing pool snap updates were interfering?

Actions

Copy link

Updated by Sage Weil over 6 years ago

Subject changed from monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel) to monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken)

Actions

Copy link

Updated by Paul Emmerich about 6 years ago

File ceph-mon-crash-on-delete.txt ceph-mon-crash-on-delete.txt added

I'm seeing this on Luminous. Some kRBD clients are sending requests of death killing the active monitor.
No special config on either mons or clients.

    -7> 2018-02-14 13:23:48.987841 7fc50dd00700  1 -- 10.255.0.30:6789/0 <== client.1621590 10.255.0.106:0/1162398434 5 ==== pool_op(delete unmanaged snap pool 6 auid 0 tid 28 name  v0) v4 ==== 65+0+0 (3119176652 0 0) 0x5587d8ec0d80 con 0x5587d8f98800
    -6> 2018-02-14 13:23:48.987849 7fc50dd00700 20 mon.ceph-dal-mon-a1@0(leader) e1 _ms_dispatch existing session 0x5587d891e580 for client.1621590 10.255.0.106:0/1162398434
    -5> 2018-02-14 13:23:48.987854 7fc50dd00700 20 mon.ceph-dal-mon-a1@0(leader) e1  caps profile rbd
    -4> 2018-02-14 13:23:48.987856 7fc50dd00700  5 mon.ceph-dal-mon-a1@0(leader).paxos(paxos updating c 4935163..4935904) is_readable = 1 - now=2018-02-14 13:23:48.987856 lease_expire=2018-02-14 13:23:53.929312 has v0 lc 4935904
    -3> 2018-02-14 13:23:48.987864 7fc50dd00700 10 mon.ceph-dal-mon-a1@0(leader).osd e19702 preprocess_query pool_op(delete unmanaged snap pool 6 auid 0 tid 28 name  v0) v4 from client.1621590 10.255.0.106:0/1162398434
    -2> 2018-02-14 13:23:48.987879 7fc50dd00700  7 mon.ceph-dal-mon-a1@0(leader).osd e19702 prepare_update pool_op(delete unmanaged snap pool 6 auid 0 tid 28 name  v0) v4 from client.1621590 10.255.0.106:0/1162398434
    -1> 2018-02-14 13:23:48.987888 7fc50dd00700 10 mon.ceph-dal-mon-a1@0(leader).osd e19702 prepare_pool_op pool_op(delete unmanaged snap pool 6 auid 0 tid 28 name  v0) v4
     0> 2018-02-14 13:23:48.991163 7fc50dd00700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/include/interval_set.h: In function 'void interval_set<T>::insert(T, T, T*, T*) [with T = snapid_t]' thread 7fc50dd00700 time 2018-02-14 13:23:48.987900
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/include/interval_set.h: 396: FAILED assert(p->first > start+len)

 ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x5587cebc0330]
 2: (interval_set<snapid_t>::insert(snapid_t, snapid_t, snapid_t*, snapid_t*)+0x32b) [0x5587ced0965b]
 3: (pg_pool_t::remove_unmanaged_snap(snapid_t)+0x4d) [0x5587cecf6a1d]
 4: (OSDMonitor::prepare_pool_op(boost::intrusive_ptr<MonOpRequest>)+0xe50) [0x5587ceaeb0f0]
 5: (OSDMonitor::prepare_update(boost::intrusive_ptr<MonOpRequest>)+0x336) [0x5587ceb22b16]
 6: (PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0xaf8) [0x5587ceaaf908]
 7: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x52f) [0x5587ce99281f]
 8: (Monitor::_ms_dispatch(Message*)+0x7eb) [0x5587ce993e5b]
 9: (Monitor::ms_dispatch(Message*)+0x23) [0x5587ce9bf923]
 10: (DispatchQueue::entry()+0x792) [0x5587cee6c162]
 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x5587cec67b6d]
 12: (()+0x7e25) [0x7fc5172e8e25]
 13: (clone()+0x6d) [0x7fc51488334d]

Full log with debug mon 20/20 is attached.

Actions

Copy link

#10

Updated by Paul Emmerich about 6 years ago

I've got a cluster here where this issue is 100% reproducible when trying to delete snapshots. Let me know if we can do/test anything to help debug this.

Actions

Copy link

#11

Updated by Paul Emmerich about 6 years ago

Still happening on 12.2.4

    -9> 2018-02-27 09:44:28.558861 7f721f786700  1 -- 10.255.0.30:6789/0 <== client.3417026 10.255.0.52:0/2767607806 5 ==== pool_op(delete unmanaged snap pool 6 auid 0 tid 17 name  v0) v4 ==== 65+0+0 (3863449434 0 0) 0x55b72b5bb680 con 0x55b72b586000
    -8> 2018-02-27 09:44:28.558868 7f721f786700 20 mon.ceph-dal-mon-a1@0(leader) e1 _ms_dispatch existing session 0x55b72bcf5d80 for client.3417026 10.255.0.52:0/2767607806
    -7> 2018-02-27 09:44:28.558884 7f721f786700 20 mon.ceph-dal-mon-a1@0(leader) e1  caps profile rbd
    -6> 2018-02-27 09:44:28.558887 7f721f786700  5 mon.ceph-dal-mon-a1@0(leader).paxos(paxos writing c 6153015..6153739) is_readable = 1 - now=2018-02-27 09:44:28.558887 lease_expire=2018-02-27 09:44:32.953713 has v0 lc 6153739
    -5> 2018-02-27 09:44:28.558894 7f721f786700 10 mon.ceph-dal-mon-a1@0(leader).osd e24190 preprocess_query pool_op(delete unmanaged snap pool 6 auid 0 tid 17 name  v0) v4 from client.3417026 10.255.0.52:0/2767607806
    -4> 2018-02-27 09:44:28.558908 7f721f786700  7 mon.ceph-dal-mon-a1@0(leader).osd e24190 prepare_update pool_op(delete unmanaged snap pool 6 auid 0 tid 17 name  v0) v4 from client.3417026 10.255.0.52:0/2767607806
    -3> 2018-02-27 09:44:28.558915 7f721f786700 10 mon.ceph-dal-mon-a1@0(leader).osd e24190 prepare_pool_op pool_op(delete unmanaged snap pool 6 auid 0 tid 17 name  v0) v4
    -2> 2018-02-27 09:44:28.562240 7f721f786700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/include/interval_set.h: In function 'void interval_set<T>::insert(T, T, T*, T*) [with T = snapid_t]' thread 7f721f786700 time 2018-02-27 09:44:28.558926
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/include/interval_set.h: 396: FAILED assert(p->first > start+len)

 ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55b7209a0470]
 2: (interval_set<snapid_t>::insert(snapid_t, snapid_t, snapid_t*, snapid_t*)+0x32b) [0x55b720ae91ab]
 3: (pg_pool_t::remove_unmanaged_snap(snapid_t)+0x4d) [0x55b720ad65ed]
 4: (OSDMonitor::prepare_pool_op(boost::intrusive_ptr<MonOpRequest>)+0xe40) [0x55b7208cb4a0]
 5: (OSDMonitor::prepare_update(boost::intrusive_ptr<MonOpRequest>)+0x33e) [0x55b720902f0e]
 6: (PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0xaf8) [0x55b72088f6a8]
 7: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x5af) [0x55b72076fc9f]
 8: (Monitor::_ms_dispatch(Message*)+0x7eb) [0x55b72077128b]
 9: (Monitor::ms_dispatch(Message*)+0x23) [0x55b72079d3f3]
 10: (DispatchQueue::entry()+0x792) [0x55b720c4fd92]
 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x55b720a477fd]
 12: (()+0x7e25) [0x7f722898fe25]
 13: (clone()+0x6d) [0x7f7225b0734d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Actions

Copy link

#12

Updated by Paul Emmerich about 6 years ago

To summarize what I've figured out to reproduce this:

both rbd client and mon are running 12.2.4, happened with 12.2.2 as well
crashes when running "rbd snap rm" on some snapshots
seen it happen on two different clusters, multiple different snapshots trigger this
things that the clusters and snapshots have in common: the crashing images are based on a non-flattened clone in an erasure-coded pool

The OpenStack test referenced in the original description also seems to be using an image with a parent.

Actions

Copy link

#13

Updated by Paul Emmerich about 6 years ago

Well, turns out there were both 12.2.1 and 12.2.4 clients doing snapshot operations. This messed up removed_snaps due to issue #21567 with data_pool handling in the librbd snapshot functionality.

So my bug is different from this issue originally described here since it predates data_pool.

Actions

Copy link

#14

Updated by Paul Emmerich about 6 years ago

Anyways, the only place where this can happen is if snap_seq < max(removed_snaps) because the deletion request inserts two elements into removed_snaps but only checks the precondition for one insert. The precondition for the second one will be violated if anyone ever sent a snap delete request for a snap id larger than snap_seq (which shouldn't be allowed).

I've fixed both of this in https://github.com/ceph/ceph/pull/20835

Actions

Copy link

#15