Project

General

Profile

Actions

Bug #18746

closed

monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken)

Added by Yiorgos Stamoulis about 7 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
Snapshots
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
luminous
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Afternoon! It would be great if anyone could shed any light on a pretty serious issue we had last week.

Essentially we had 2 out of 3 monitors of a cluster fail within seconds of each other with (extract from ceph-mon.monitor-2.log shown, see attachments for more details), thus causing the third monitor to be unable to reach quorum, leaving out ceph cluster to grind to a halt!


2017-01-27 15:58:37.309514 7f82ff93f700 -1 ./include/interval_set.h: In function 'void interval_set<T>::insert(T, T, T*, T*) [with T = snapid_t]' thread 7f82ff93f700 time 2017-01-27 15:58:37.305538
./include/interval_set.h: 355: FAILED assert(0)

 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x5558c6244bfb]
 2: (interval_set<snapid_t>::insert(snapid_t, snapid_t, snapid_t*, snapid_t*)+0x35c) [0x5558c6309c3c]
 3: (pg_pool_t::remove_unmanaged_snap(snapid_t)+0x4d) [0x5558c62ffa2d]
 4: (OSDMonitor::prepare_pool_op(std::shared_ptr<MonOpRequest>)+0xe34) [0x5558c5f39864]
 5: (OSDMonitor::prepare_update(std::shared_ptr<MonOpRequest>)+0x28f) [0x5558c5f5b8bf]
 6: (PaxosService::dispatch(std::shared_ptr<MonOpRequest>)+0xdab) [0x5558c5f0b24b]
 7: (PaxosService::C_RetryMessage::_finish(int)+0x54) [0x5558c5f0e824]
 8: (C_MonOp::finish(int)+0x69) [0x5558c5edb539]
 9: (Context::complete(int)+0x9) [0x5558c5eda6d9]
 10: (void finish_contexts<Context>(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0x94) [0x5558c5ee0934]
 11: (Paxos::finish_round()+0x10b) [0x5558c5f0306b]
 12: (Paxos::handle_last(std::shared_ptr<MonOpRequest>)+0xee4) [0x5558c5f044e4]
 13: (Paxos::dispatch(std::shared_ptr<MonOpRequest>)+0x2e4) [0x5558c5f04e74]
 14: (Monitor::dispatch_op(std::shared_ptr<MonOpRequest>)+0xb75) [0x5558c5ed5b05]
 15: (Monitor::_ms_dispatch(Message*)+0x6c1) [0x5558c5ed65f1]
 16: (Monitor::ms_dispatch(Message*)+0x23) [0x5558c5ef5873]
 17: (DispatchQueue::entry()+0x78b) [0x5558c632d58b]
 18: (DispatchQueue::DispatchThread::entry()+0xd) [0x5558c622a68d]
 19: (()+0x8184) [0x7f830881e184]
 20: (clone()+0x6d) [0x7f8306b7037d]

This is cluster supporting OpenStack (cinder & glance, liberty release), currently under testing.

Standard practice at our company is to run refstack (https://wiki.openstack.org/wiki/RefStack), a tool that tests OpenStack functionality, and we believe test tempest.api.volume.test_volumes_get.VolumesV2GetTest.test_volume_create_get_update_delete_as_clone[id-3f591b4a-7dc6-444c-bd51-77469506b3a1] (https://github.com/openstack/tempest/blob/master/tempest/api/volume/test_volumes_get.py) triggered the unexpected response from ceph

Subsequent restarts of ceph monitors failed until we stopped cinder & nova services on the OpenStack Cluster. After that, both clusters were able to recover.

We have tried, but were unable, to replicate the crash.

In order to ensure the availability of the cluster we would like to determine the conditions that caused the monitor crashes and whether they were indeed related to refstack actions or something entirely different.


Files

ceph-mon.log.tar.gz (410 KB) ceph-mon.log.tar.gz monitor logs Yiorgos Stamoulis, 01/31/2017 02:13 PM
ceph-mon-crash-on-delete.txt (599 KB) ceph-mon-crash-on-delete.txt Paul Emmerich, 02/14/2018 10:02 PM

Related issues 2 (0 open2 closed)

Has duplicate Ceph - Bug #19824: Reccurance of #18746(Jewel) in (Kraken)Duplicate05/02/2017

Actions
Copied to RADOS - Backport #23915: luminous: monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken)ResolvedActions
Actions #1

Updated by Jason Dillaman almost 7 years ago

  • Priority changed from Normal to Urgent
Actions #2

Updated by Greg Farnum almost 7 years ago

  • Project changed from Ceph to RADOS
  • Category changed from Monitor to Snapshots

Pretty weird, that assert appears to be an internal interval_set consistency thing: https://github.com/ceph/ceph/blob/jewel/src/include/interval_set.h#L355

Didn't see in a quick skim how it could fail either.

Actions #3

Updated by Yiorgos Stamoulis almost 7 years ago

Hi Greg,

Thank you for taking the time to look into this.

Following the incident of the present ticket the cluster was upgraded to Kraken. We then had a further issue (http://tracker.ceph.com/issues/19824).

I believe my colleague Ross was given a hint in IRC the the problem could be stale/inconsistent metadata in a pool. We then recreated the pools and all has been good since. Hope this makes sense.

So, as far as our cluster is concerned all is good now. However, with the pointer that some bad pool metadata may have triggered this it may be good idea to come get to the bottom of this in the interest of ceph stability.

Actions #4

Updated by Greg Farnum almost 7 years ago

  • Has duplicate Bug #19824: Reccurance of #18746(Jewel) in (Kraken) added
Actions #5

Updated by Sage Weil almost 7 years ago

  • Subject changed from monitors crashing ./include/interval_set.h: 355: FAILED assert(0) to monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel)
Actions #6

Updated by Sage Weil almost 7 years ago

  • Status changed from New to Need More Info
  • Priority changed from Urgent to High
Actions #7

Updated by Sage Weil almost 7 years ago

The fact that the error stopped when cinder was stopped makes me thing this was related to in-flight requests from those services. Maybe two racing pool snap updates were interfering?

Actions #8

Updated by Sage Weil over 6 years ago

  • Subject changed from monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel) to monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken)
Actions #9

Updated by Paul Emmerich about 6 years ago

I'm seeing this on Luminous. Some kRBD clients are sending requests of death killing the active monitor.
No special config on either mons or clients.

    -7> 2018-02-14 13:23:48.987841 7fc50dd00700  1 -- 10.255.0.30:6789/0 <== client.1621590 10.255.0.106:0/1162398434 5 ==== pool_op(delete unmanaged snap pool 6 auid 0 tid 28 name  v0) v4 ==== 65+0+0 (3119176652 0 0) 0x5587d8ec0d80 con 0x5587d8f98800
    -6> 2018-02-14 13:23:48.987849 7fc50dd00700 20 mon.ceph-dal-mon-a1@0(leader) e1 _ms_dispatch existing session 0x5587d891e580 for client.1621590 10.255.0.106:0/1162398434
    -5> 2018-02-14 13:23:48.987854 7fc50dd00700 20 mon.ceph-dal-mon-a1@0(leader) e1  caps profile rbd
    -4> 2018-02-14 13:23:48.987856 7fc50dd00700  5 mon.ceph-dal-mon-a1@0(leader).paxos(paxos updating c 4935163..4935904) is_readable = 1 - now=2018-02-14 13:23:48.987856 lease_expire=2018-02-14 13:23:53.929312 has v0 lc 4935904
    -3> 2018-02-14 13:23:48.987864 7fc50dd00700 10 mon.ceph-dal-mon-a1@0(leader).osd e19702 preprocess_query pool_op(delete unmanaged snap pool 6 auid 0 tid 28 name  v0) v4 from client.1621590 10.255.0.106:0/1162398434
    -2> 2018-02-14 13:23:48.987879 7fc50dd00700  7 mon.ceph-dal-mon-a1@0(leader).osd e19702 prepare_update pool_op(delete unmanaged snap pool 6 auid 0 tid 28 name  v0) v4 from client.1621590 10.255.0.106:0/1162398434
    -1> 2018-02-14 13:23:48.987888 7fc50dd00700 10 mon.ceph-dal-mon-a1@0(leader).osd e19702 prepare_pool_op pool_op(delete unmanaged snap pool 6 auid 0 tid 28 name  v0) v4
     0> 2018-02-14 13:23:48.991163 7fc50dd00700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/include/interval_set.h: In function 'void interval_set<T>::insert(T, T, T*, T*) [with T = snapid_t]' thread 7fc50dd00700 time 2018-02-14 13:23:48.987900
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/include/interval_set.h: 396: FAILED assert(p->first > start+len)

 ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x5587cebc0330]
 2: (interval_set<snapid_t>::insert(snapid_t, snapid_t, snapid_t*, snapid_t*)+0x32b) [0x5587ced0965b]
 3: (pg_pool_t::remove_unmanaged_snap(snapid_t)+0x4d) [0x5587cecf6a1d]
 4: (OSDMonitor::prepare_pool_op(boost::intrusive_ptr<MonOpRequest>)+0xe50) [0x5587ceaeb0f0]
 5: (OSDMonitor::prepare_update(boost::intrusive_ptr<MonOpRequest>)+0x336) [0x5587ceb22b16]
 6: (PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0xaf8) [0x5587ceaaf908]
 7: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x52f) [0x5587ce99281f]
 8: (Monitor::_ms_dispatch(Message*)+0x7eb) [0x5587ce993e5b]
 9: (Monitor::ms_dispatch(Message*)+0x23) [0x5587ce9bf923]
 10: (DispatchQueue::entry()+0x792) [0x5587cee6c162]
 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x5587cec67b6d]
 12: (()+0x7e25) [0x7fc5172e8e25]
 13: (clone()+0x6d) [0x7fc51488334d]

Full log with debug mon 20/20 is attached.

Actions #10

Updated by Paul Emmerich about 6 years ago

I've got a cluster here where this issue is 100% reproducible when trying to delete snapshots. Let me know if we can do/test anything to help debug this.

Actions #11

Updated by Paul Emmerich about 6 years ago

Still happening on 12.2.4

    -9> 2018-02-27 09:44:28.558861 7f721f786700  1 -- 10.255.0.30:6789/0 <== client.3417026 10.255.0.52:0/2767607806 5 ==== pool_op(delete unmanaged snap pool 6 auid 0 tid 17 name  v0) v4 ==== 65+0+0 (3863449434 0 0) 0x55b72b5bb680 con 0x55b72b586000
    -8> 2018-02-27 09:44:28.558868 7f721f786700 20 mon.ceph-dal-mon-a1@0(leader) e1 _ms_dispatch existing session 0x55b72bcf5d80 for client.3417026 10.255.0.52:0/2767607806
    -7> 2018-02-27 09:44:28.558884 7f721f786700 20 mon.ceph-dal-mon-a1@0(leader) e1  caps profile rbd
    -6> 2018-02-27 09:44:28.558887 7f721f786700  5 mon.ceph-dal-mon-a1@0(leader).paxos(paxos writing c 6153015..6153739) is_readable = 1 - now=2018-02-27 09:44:28.558887 lease_expire=2018-02-27 09:44:32.953713 has v0 lc 6153739
    -5> 2018-02-27 09:44:28.558894 7f721f786700 10 mon.ceph-dal-mon-a1@0(leader).osd e24190 preprocess_query pool_op(delete unmanaged snap pool 6 auid 0 tid 17 name  v0) v4 from client.3417026 10.255.0.52:0/2767607806
    -4> 2018-02-27 09:44:28.558908 7f721f786700  7 mon.ceph-dal-mon-a1@0(leader).osd e24190 prepare_update pool_op(delete unmanaged snap pool 6 auid 0 tid 17 name  v0) v4 from client.3417026 10.255.0.52:0/2767607806
    -3> 2018-02-27 09:44:28.558915 7f721f786700 10 mon.ceph-dal-mon-a1@0(leader).osd e24190 prepare_pool_op pool_op(delete unmanaged snap pool 6 auid 0 tid 17 name  v0) v4
    -2> 2018-02-27 09:44:28.562240 7f721f786700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/include/interval_set.h: In function 'void interval_set<T>::insert(T, T, T*, T*) [with T = snapid_t]' thread 7f721f786700 time 2018-02-27 09:44:28.558926
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/include/interval_set.h: 396: FAILED assert(p->first > start+len)

 ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55b7209a0470]
 2: (interval_set<snapid_t>::insert(snapid_t, snapid_t, snapid_t*, snapid_t*)+0x32b) [0x55b720ae91ab]
 3: (pg_pool_t::remove_unmanaged_snap(snapid_t)+0x4d) [0x55b720ad65ed]
 4: (OSDMonitor::prepare_pool_op(boost::intrusive_ptr<MonOpRequest>)+0xe40) [0x55b7208cb4a0]
 5: (OSDMonitor::prepare_update(boost::intrusive_ptr<MonOpRequest>)+0x33e) [0x55b720902f0e]
 6: (PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0xaf8) [0x55b72088f6a8]
 7: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x5af) [0x55b72076fc9f]
 8: (Monitor::_ms_dispatch(Message*)+0x7eb) [0x55b72077128b]
 9: (Monitor::ms_dispatch(Message*)+0x23) [0x55b72079d3f3]
 10: (DispatchQueue::entry()+0x792) [0x55b720c4fd92]
 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x55b720a477fd]
 12: (()+0x7e25) [0x7f722898fe25]
 13: (clone()+0x6d) [0x7f7225b0734d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Actions #12

Updated by Paul Emmerich about 6 years ago

To summarize what I've figured out to reproduce this:

  • both rbd client and mon are running 12.2.4, happened with 12.2.2 as well
  • crashes when running "rbd snap rm" on some snapshots
  • seen it happen on two different clusters, multiple different snapshots trigger this
  • things that the clusters and snapshots have in common: the crashing images are based on a non-flattened clone in an erasure-coded pool

The OpenStack test referenced in the original description also seems to be using an image with a parent.

Actions #13

Updated by Paul Emmerich about 6 years ago

Well, turns out there were both 12.2.1 and 12.2.4 clients doing snapshot operations. This messed up removed_snaps due to issue #21567 with data_pool handling in the librbd snapshot functionality.

So my bug is different from this issue originally described here since it predates data_pool.

Actions #14

Updated by Paul Emmerich about 6 years ago

Anyways, the only place where this can happen is if snap_seq < max(removed_snaps) because the deletion request inserts two elements into removed_snaps but only checks the precondition for one insert. The precondition for the second one will be violated if anyone ever sent a snap delete request for a snap id larger than snap_seq (which shouldn't be allowed).

I've fixed both of this in https://github.com/ceph/ceph/pull/20835

Actions #15

Updated by Greg Farnum about 6 years ago

  • Status changed from Need More Info to 4

Under discussion on the PR, which is good on its own terms but suffering from a prior CephFS bug. :(

Actions #16

Updated by Greg Farnum about 6 years ago

  • Status changed from 4 to Fix Under Review
Actions #17

Updated by Sage Weil almost 6 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to luminous
Actions #18

Updated by Nathan Cutler almost 6 years ago

  • Copied to Backport #23915: luminous: monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken) added
Actions #19

Updated by Samuel Just over 4 years ago

  • Status changed from Pending Backport to Resolved

Already backported to luminous.

Actions

Also available in: Atom PDF