Project

General

Profile

Actions

Bug #19824

closed

Reccurance of #18746(Jewel) in (Kraken)

Added by Ross Martyn almost 7 years ago. Updated almost 7 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Three months ago we had an issue where our monitors crashed following a run of a set of 'OpenStack refstack tests'. It appeared to be http://tracker.ceph.com/issues/18746

Today we have had a very similar issue, twice, however this time during a user delete of a snapshot of a volume. Since the previous version of this issue we have upgraded to Kraken (11.2.0).

We believe this to be very serious as the only way to rectify appeared to be rebooting all OpenStack infrastructure working with Ceph. (i,e all hypervisors & Cinder-volume instances) then restarting the monitors one by one. Far from ideal!

ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7)
1: (()+0x6e77c2) [0x560058c2e7c2]
2: (()+0x10330) [0x7fc36c677330]
3: (gsignal()+0x37) [0x7fc36b322c37]
4: (abort()+0x148) [0x7fc36b326028]
5: (interval_set<snapid_t>::insert(snapid_t, snapid_t, snapid_t*, snapid_t*)+0x342) [0x560058b23a72]
6: (pg_pool_t::remove_unmanaged_snap(snapid_t)+0x4d) [0x560058b18a7d]
7: (OSDMonitor::prepare_pool_op(std::shared_ptr<MonOpRequest>)+0xde9) [0x5600588d1639]
8: (OSDMonitor::prepare_update(std::shared_ptr<MonOpRequest>)+0x32f) [0x5600588f7ccf]
9: (PaxosService::dispatch(std::shared_ptr<MonOpRequest>)+0xf02) [0x5600588a72e2]
10: (PaxosService::C_RetryMessage::_finish(int)+0x54) [0x5600588a8c74]
11: (C_MonOp::finish(int)+0x69) [0x5600588731f9]
12: (Context::complete(int)+0x9) [0x560058872369]
13: (void finish_contexts<Context>(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0x94) [0x560058879604]
14: (Paxos::finish_round()+0x10b) [0x56005889cc6b]
15: (Paxos::handle_last(std::shared_ptr<MonOpRequest>)+0xf86) [0x56005889e1b6]
16: (Paxos::dispatch(std::shared_ptr<MonOpRequest>)+0x2e4) [0x56005889eb64]
17: (Monitor::dispatch_op(std::shared_ptr<MonOpRequest>)+0xd19) [0x56005886cdb9]
18: (Monitor::_ms_dispatch(Message*)+0x6a1) [0x56005886d711]
19: (Monitor::ms_dispatch(Message*)+0x23) [0x56005888f2a3]
20: (DispatchQueue::entry()+0x793) [0x560058be9f63]
21: (DispatchQueue::DispatchThread::entry()+0xd) [0x560058aad8ad]
22: (()+0x8184) [0x7fc36c66f184]
23: (clone()+0x6d) [0x7fc36b3e9bed]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Logs Attached.

  • Cinder Logs - Unfortunately not set to debug at the time.
  • Monitor Log - This is very long, but a lot happens in a few seconds!

Looking at the API log vs the Monitor log, The API call entry appears to be the request that killed the monitors (11:27:43).

There is no reference to this particular action in the Cinder Volume log.


For unfamiliar OpenStack users reference - Cinder API call -> Cinder Volume -> Ceph RBD image/snapshot 'Delete' call.


To make this slightly more complicated, during this fact finding mission we have found that pinned packages on our cinder controllers we pinning ceph-common and associated components to Jewel. Unsure if this miss match may be in part responsible. Looking to upgrade these now.


Files

ceph-mon-1 (525 KB) ceph-mon-1 Ross Martyn, 05/02/2017 02:51 PM
Cinder-Vol (5.26 KB) Cinder-Vol Ross Martyn, 05/02/2017 02:51 PM

Related issues 1 (0 open1 closed)

Is duplicate of RADOS - Bug #18746: monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken)Resolved01/30/2017

Actions
Actions #1

Updated by Greg Farnum almost 7 years ago

  • Is duplicate of Bug #18746: monitors crashing ./include/interval_set.h: 355: FAILED assert(0) (jewel+kraken) added
Actions #2

Updated by Greg Farnum almost 7 years ago

  • Status changed from New to Duplicate
Actions

Also available in: Atom PDF