Bug #18746
closedmonitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken)
0%
Description
Afternoon! It would be great if anyone could shed any light on a pretty serious issue we had last week.
Essentially we had 2 out of 3 monitors of a cluster fail within seconds of each other with (extract from ceph-mon.monitor-2.log shown, see attachments for more details), thus causing the third monitor to be unable to reach quorum, leaving out ceph cluster to grind to a halt!
2017-01-27 15:58:37.309514 7f82ff93f700 -1 ./include/interval_set.h: In function 'void interval_set<T>::insert(T, T, T*, T*) [with T = snapid_t]' thread 7f82ff93f700 time 2017-01-27 15:58:37.305538
./include/interval_set.h: 355: FAILED assert(0)
ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x5558c6244bfb]
2: (interval_set<snapid_t>::insert(snapid_t, snapid_t, snapid_t*, snapid_t*)+0x35c) [0x5558c6309c3c]
3: (pg_pool_t::remove_unmanaged_snap(snapid_t)+0x4d) [0x5558c62ffa2d]
4: (OSDMonitor::prepare_pool_op(std::shared_ptr<MonOpRequest>)+0xe34) [0x5558c5f39864]
5: (OSDMonitor::prepare_update(std::shared_ptr<MonOpRequest>)+0x28f) [0x5558c5f5b8bf]
6: (PaxosService::dispatch(std::shared_ptr<MonOpRequest>)+0xdab) [0x5558c5f0b24b]
7: (PaxosService::C_RetryMessage::_finish(int)+0x54) [0x5558c5f0e824]
8: (C_MonOp::finish(int)+0x69) [0x5558c5edb539]
9: (Context::complete(int)+0x9) [0x5558c5eda6d9]
10: (void finish_contexts<Context>(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0x94) [0x5558c5ee0934]
11: (Paxos::finish_round()+0x10b) [0x5558c5f0306b]
12: (Paxos::handle_last(std::shared_ptr<MonOpRequest>)+0xee4) [0x5558c5f044e4]
13: (Paxos::dispatch(std::shared_ptr<MonOpRequest>)+0x2e4) [0x5558c5f04e74]
14: (Monitor::dispatch_op(std::shared_ptr<MonOpRequest>)+0xb75) [0x5558c5ed5b05]
15: (Monitor::_ms_dispatch(Message*)+0x6c1) [0x5558c5ed65f1]
16: (Monitor::ms_dispatch(Message*)+0x23) [0x5558c5ef5873]
17: (DispatchQueue::entry()+0x78b) [0x5558c632d58b]
18: (DispatchQueue::DispatchThread::entry()+0xd) [0x5558c622a68d]
19: (()+0x8184) [0x7f830881e184]
20: (clone()+0x6d) [0x7f8306b7037d]
This is cluster supporting OpenStack (cinder & glance, liberty release), currently under testing.
Standard practice at our company is to run refstack (https://wiki.openstack.org/wiki/RefStack), a tool that tests OpenStack functionality, and we believe test tempest.api.volume.test_volumes_get.VolumesV2GetTest.test_volume_create_get_update_delete_as_clone[id-3f591b4a-7dc6-444c-bd51-77469506b3a1] (https://github.com/openstack/tempest/blob/master/tempest/api/volume/test_volumes_get.py) triggered the unexpected response from ceph
Subsequent restarts of ceph monitors failed until we stopped cinder & nova services on the OpenStack Cluster. After that, both clusters were able to recover.
We have tried, but were unable, to replicate the crash.
In order to ensure the availability of the cluster we would like to determine the conditions that caused the monitor crashes and whether they were indeed related to refstack actions or something entirely different.
Files