Bug #18746: monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken) - RADOS - Ceph

Actions

Copy link

Bug #18746

closed

monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken)

Added by Yiorgos Stamoulis over 7 years ago. Updated over 4 years ago.

Status:

Resolved

Priority:

High

Assignee:

Category:

Snapshots

Target version:

% Done:

Source:

Tags:

Backport:

luminous

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v10.2.2

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Afternoon! It would be great if anyone could shed any light on a pretty serious issue we had last week.

Essentially we had 2 out of 3 monitors of a cluster fail within seconds of each other with (extract from ceph-mon.monitor-2.log shown, see attachments for more details), thus causing the third monitor to be unable to reach quorum, leaving out ceph cluster to grind to a halt!


2017-01-27 15:58:37.309514 7f82ff93f700 -1 ./include/interval_set.h: In function 'void interval_set<T>::insert(T, T, T*, T*) [with T = snapid_t]' thread 7f82ff93f700 time 2017-01-27 15:58:37.305538
./include/interval_set.h: 355: FAILED assert(0)

 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x5558c6244bfb]
 2: (interval_set<snapid_t>::insert(snapid_t, snapid_t, snapid_t*, snapid_t*)+0x35c) [0x5558c6309c3c]
 3: (pg_pool_t::remove_unmanaged_snap(snapid_t)+0x4d) [0x5558c62ffa2d]
 4: (OSDMonitor::prepare_pool_op(std::shared_ptr<MonOpRequest>)+0xe34) [0x5558c5f39864]
 5: (OSDMonitor::prepare_update(std::shared_ptr<MonOpRequest>)+0x28f) [0x5558c5f5b8bf]
 6: (PaxosService::dispatch(std::shared_ptr<MonOpRequest>)+0xdab) [0x5558c5f0b24b]
 7: (PaxosService::C_RetryMessage::_finish(int)+0x54) [0x5558c5f0e824]
 8: (C_MonOp::finish(int)+0x69) [0x5558c5edb539]
 9: (Context::complete(int)+0x9) [0x5558c5eda6d9]
 10: (void finish_contexts<Context>(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0x94) [0x5558c5ee0934]
 11: (Paxos::finish_round()+0x10b) [0x5558c5f0306b]
 12: (Paxos::handle_last(std::shared_ptr<MonOpRequest>)+0xee4) [0x5558c5f044e4]
 13: (Paxos::dispatch(std::shared_ptr<MonOpRequest>)+0x2e4) [0x5558c5f04e74]
 14: (Monitor::dispatch_op(std::shared_ptr<MonOpRequest>)+0xb75) [0x5558c5ed5b05]
 15: (Monitor::_ms_dispatch(Message*)+0x6c1) [0x5558c5ed65f1]
 16: (Monitor::ms_dispatch(Message*)+0x23) [0x5558c5ef5873]
 17: (DispatchQueue::entry()+0x78b) [0x5558c632d58b]
 18: (DispatchQueue::DispatchThread::entry()+0xd) [0x5558c622a68d]
 19: (()+0x8184) [0x7f830881e184]
 20: (clone()+0x6d) [0x7f8306b7037d]

This is cluster supporting OpenStack (cinder & glance, liberty release), currently under testing.

Standard practice at our company is to run refstack (https://wiki.openstack.org/wiki/RefStack), a tool that tests OpenStack functionality, and we believe test tempest.api.volume.test_volumes_get.VolumesV2GetTest.test_volume_create_get_update_delete_as_clone[id-3f591b4a-7dc6-444c-bd51-77469506b3a1] (https://github.com/openstack/tempest/blob/master/tempest/api/volume/test_volumes_get.py) triggered the unexpected response from ceph

Subsequent restarts of ceph monitors failed until we stopped cinder & nova services on the OpenStack Cluster. After that, both clusters were able to recover.

We have tried, but were unable, to replicate the crash.

In order to ensure the availability of the cluster we would like to determine the conditions that caused the monitor crashes and whether they were indeed related to refstack actions or something entirely different.

Files

Download all files

ceph-mon.log.tar.gz (410 KB) ceph-mon.log.tar.gz	monitor logs	Yiorgos Stamoulis, 01/31/2017 02:13 PM
ceph-mon-crash-on-delete.txt (599 KB) ceph-mon-crash-on-delete.txt		Paul Emmerich, 02/14/2018 10:02 PM

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #18746

monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken)

Updated by Jason Dillaman about 7 years ago

Updated by Greg Farnum almost 7 years ago

Updated by Yiorgos Stamoulis almost 7 years ago

Updated by Greg Farnum almost 7 years ago

Updated by Sage Weil almost 7 years ago

Updated by Sage Weil almost 7 years ago

Updated by Sage Weil almost 7 years ago

Updated by Sage Weil over 6 years ago

Updated by Paul Emmerich about 6 years ago

Updated by Paul Emmerich about 6 years ago

Updated by Paul Emmerich about 6 years ago

Updated by Paul Emmerich about 6 years ago

Updated by Paul Emmerich about 6 years ago

Updated by Paul Emmerich about 6 years ago

Updated by Greg Farnum about 6 years ago

Updated by Greg Farnum about 6 years ago

Updated by Sage Weil about 6 years ago

Updated by Nathan Cutler about 6 years ago

Updated by Samuel Just over 4 years ago