Project

General

Profile

Actions

Bug #18746

closed

monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken)

Added by Yiorgos Stamoulis over 7 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
Snapshots
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
luminous
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Afternoon! It would be great if anyone could shed any light on a pretty serious issue we had last week.

Essentially we had 2 out of 3 monitors of a cluster fail within seconds of each other with (extract from ceph-mon.monitor-2.log shown, see attachments for more details), thus causing the third monitor to be unable to reach quorum, leaving out ceph cluster to grind to a halt!


2017-01-27 15:58:37.309514 7f82ff93f700 -1 ./include/interval_set.h: In function 'void interval_set<T>::insert(T, T, T*, T*) [with T = snapid_t]' thread 7f82ff93f700 time 2017-01-27 15:58:37.305538
./include/interval_set.h: 355: FAILED assert(0)

 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x5558c6244bfb]
 2: (interval_set<snapid_t>::insert(snapid_t, snapid_t, snapid_t*, snapid_t*)+0x35c) [0x5558c6309c3c]
 3: (pg_pool_t::remove_unmanaged_snap(snapid_t)+0x4d) [0x5558c62ffa2d]
 4: (OSDMonitor::prepare_pool_op(std::shared_ptr<MonOpRequest>)+0xe34) [0x5558c5f39864]
 5: (OSDMonitor::prepare_update(std::shared_ptr<MonOpRequest>)+0x28f) [0x5558c5f5b8bf]
 6: (PaxosService::dispatch(std::shared_ptr<MonOpRequest>)+0xdab) [0x5558c5f0b24b]
 7: (PaxosService::C_RetryMessage::_finish(int)+0x54) [0x5558c5f0e824]
 8: (C_MonOp::finish(int)+0x69) [0x5558c5edb539]
 9: (Context::complete(int)+0x9) [0x5558c5eda6d9]
 10: (void finish_contexts<Context>(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0x94) [0x5558c5ee0934]
 11: (Paxos::finish_round()+0x10b) [0x5558c5f0306b]
 12: (Paxos::handle_last(std::shared_ptr<MonOpRequest>)+0xee4) [0x5558c5f044e4]
 13: (Paxos::dispatch(std::shared_ptr<MonOpRequest>)+0x2e4) [0x5558c5f04e74]
 14: (Monitor::dispatch_op(std::shared_ptr<MonOpRequest>)+0xb75) [0x5558c5ed5b05]
 15: (Monitor::_ms_dispatch(Message*)+0x6c1) [0x5558c5ed65f1]
 16: (Monitor::ms_dispatch(Message*)+0x23) [0x5558c5ef5873]
 17: (DispatchQueue::entry()+0x78b) [0x5558c632d58b]
 18: (DispatchQueue::DispatchThread::entry()+0xd) [0x5558c622a68d]
 19: (()+0x8184) [0x7f830881e184]
 20: (clone()+0x6d) [0x7f8306b7037d]

This is cluster supporting OpenStack (cinder & glance, liberty release), currently under testing.

Standard practice at our company is to run refstack (https://wiki.openstack.org/wiki/RefStack), a tool that tests OpenStack functionality, and we believe test tempest.api.volume.test_volumes_get.VolumesV2GetTest.test_volume_create_get_update_delete_as_clone[id-3f591b4a-7dc6-444c-bd51-77469506b3a1] (https://github.com/openstack/tempest/blob/master/tempest/api/volume/test_volumes_get.py) triggered the unexpected response from ceph

Subsequent restarts of ceph monitors failed until we stopped cinder & nova services on the OpenStack Cluster. After that, both clusters were able to recover.

We have tried, but were unable, to replicate the crash.

In order to ensure the availability of the cluster we would like to determine the conditions that caused the monitor crashes and whether they were indeed related to refstack actions or something entirely different.


Files

ceph-mon.log.tar.gz (410 KB) ceph-mon.log.tar.gz monitor logs Yiorgos Stamoulis, 01/31/2017 02:13 PM
ceph-mon-crash-on-delete.txt (599 KB) ceph-mon-crash-on-delete.txt Paul Emmerich, 02/14/2018 10:02 PM

Related issues 2 (0 open2 closed)

Has duplicate Ceph - Bug #19824: Reccurance of #18746(Jewel) in (Kraken)Duplicate05/02/2017

Actions
Copied to RADOS - Backport #23915: luminous: monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken)ResolvedActions
Actions

Also available in: Atom PDF