Project

General

Profile

Bug #16525

mon crash: crush/CrushWrapper.h: 940: FAILED assert(successful_detach)

Added by George Shuklin 10 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Monitor
Target version:
-
Start date:
06/29/2016
Due date:
% Done:

0%

Source:
other
Tags:
Backport:
jewel, hammer
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
jewel
Needs Doc:
No

Description

I've got crashed all ceph mons simultaneous when I've tried to move host with working OSD from one root to the other.

My command was:

ceph osd crush move pp1 root=fast2500

```
Trace:
9> 2016-06-29 14:10:30.337919 7f6e3951d700 5 - op tracker -- seq: 11, time: 2016-06-29 14:10:30.337919, event: callback finished, op: mon_command({"prefix": "osd crush move", "args": ["root=fast2500"], "name": "pp1"} v 0)
8> 2016-06-29 14:10:30.337924 7f6e3951d700 5 - op tracker -- seq: 11, time: 2016-06-29 14:10:30.337924, event: psvc:dispatch, op: mon_command({"prefix": "osd crush move", "args": ["root=fast2500"], "name": "pp1"} v 0)
7> 2016-06-29 14:10:30.337927 7f6e3951d700 5 mon.pp5@0(leader).paxos(paxos active c 2446247..2446981) is_readable = 1 - now=2016-06-29 14:10:30.337928 lease_expire=2016-06-29 14:10:35.337162 has v0 lc 2446981
-6> 2016-06-29 14:10:30.337950 7f6e3951d700 5 -
op tracker -- seq: 11, time: 2016-06-29 14:10:30.337950, event: osdmap:preprocess_query, op: mon_command({"prefix": "osd crush move", "args": ["root=fast2500"], "name": "pp1"} v 0)
5> 2016-06-29 14:10:30.337956 7f6e3951d700 5 - op tracker -- seq: 11, time: 2016-06-29 14:10:30.337956, event: osdmap:preprocess_command, op: mon_command({"prefix": "osd crush move", "args": ["root=fast2500"], "name": "pp1"} v 0)
4> 2016-06-29 14:10:30.338007 7f6e3951d700 5 - op tracker -- seq: 11, time: 2016-06-29 14:10:30.338007, event: osdmap:prepare_update, op: mon_command({"prefix": "osd crush move", "args": ["root=fast2500"], "name": "pp1"} v 0)
3> 2016-06-29 14:10:30.338016 7f6e3951d700 5 - op tracker -- seq: 11, time: 2016-06-29 14:10:30.338015, event: osdmap:prepare_command, op: mon_command({"prefix": "osd crush move", "args": ["root=fast2500"], "name": "pp1"} v 0)
2> 2016-06-29 14:10:30.338039 7f6e3951d700 5 - op tracker -- seq: 11, time: 2016-06-29 14:10:30.338036, event: osdmap:prepare_command_impl, op: mon_command({"prefix": "osd crush move", "args": ["root=fast2500"], "name": "pp1"} v 0)
-1> 2016-06-29 14:10:30.338052 7f6e3951d700 0 mon.pp5@0(leader).osd e10230 moving crush item name 'pp1' to location {root=fast2500}
0> 2016-06-29 14:10:30.341861 7f6e3951d700 -1 crush/CrushWrapper.h: In function 'int CrushWrapper::detach_bucket(CephContext*, int)' thread 7f6e3951d700 time 2016-06-29 14:10:30.338135
crush/CrushWrapper.h: 940: FAILED assert(successful_detach)

ceph version 10.2.0 (3a9fba20ec743699b69bd0181dd6c54dc01c64b9)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x55cda5fb9fa0]
2: (()+0x560833) [0x55cda5ec8833]
3: (CrushWrapper::move_bucket(CephContext*, int, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&)+0xda) [0x55cda5ec644a]
4: (OSDMonitor::prepare_command_impl(std::shared_ptr<MonOpRequest>, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > > > >&)+0x2cfe) [0x55cda5c8701e]
5: (OSDMonitor::prepare_command(std::shared_ptr<MonOpRequest>)+0x2ff) [0x55cda5c9903f]
6: (OSDMonitor::prepare_update(std::shared_ptr<MonOpRequest>)+0x24b) [0x55cda5c9958b]
7: (PaxosService::dispatch(std::shared_ptr<MonOpRequest>)+0xb4f) [0x55cda5c4c0af]
8: (PaxosService::C_RetryMessage::_finish(int)+0x58) [0x55cda5c4d698]
9: (C_MonOp::finish(int)+0x82) [0x55cda5c15862]
10: (Context::complete(int)+0x9) [0x55cda5c14949]
11: (void finish_contexts<Context>(CephContext*, std::__cxx11::list<Context*, std::allocator<Context*> >&, int)+0x1fb) [0x55cda5c1b25b]
12: (Paxos::finish_round()+0x287) [0x55cda5c41b17]
13: (Paxos::handle_last(std::shared_ptr<MonOpRequest>)+0xe19) [0x55cda5c42cf9]
14: (Paxos::dispatch(std::shared_ptr<MonOpRequest>)+0x250) [0x55cda5c43520]
15: (Monitor::dispatch_op(std::shared_ptr<MonOpRequest>)+0xa38) [0x55cda5c0ee68]
16: (Monitor::_ms_dispatch(Message*)+0x554) [0x55cda5c0f664]
17: (Monitor::ms_dispatch(Message*)+0x23) [0x55cda5c326f3]
18: (DispatchQueue::entry()+0xf2b) [0x55cda60aedfb]
19: (DispatchQueue::DispatchThread::entry()+0xd) [0x55cda5fa032d]
20: (()+0x76fa) [0x7f6e4165a6fa]
21: (clone()+0x6d) [0x7f6e3f916b5d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 xio
1/ 5 compressor
1/ 5 newstore
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
1/ 5 kinetic
1/ 5 fuse
2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-mon.pp5.log
--
end dump of recent events ---
2016-06-29 14:10:30.346791 7f6e3951d700 -1 ** Caught signal (Aborted) *
in thread 7f6e3951d700 thread_name:ms_dispatch

ceph version 10.2.0 (3a9fba20ec743699b69bd0181dd6c54dc01c64b9)
1: (()+0x5233be) [0x55cda5e8b3be]
2: (()+0x113d0) [0x7f6e416643d0]
3: (gsignal()+0x38) [0x7f6e3f845418]
4: (abort()+0x16a) [0x7f6e3f84701a]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x26b) [0x55cda5fba18b]
6: (()+0x560833) [0x55cda5ec8833]
7: (CrushWrapper::move_bucket(CephContext*, int, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&)+0xda) [0x55cda5ec644a]
8: (OSDMonitor::prepare_command_impl(std::shared_ptr<MonOpRequest>, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > > > >&)+0x2cfe) [0x55cda5c8701e]
9: (OSDMonitor::prepare_command(std::shared_ptr<MonOpRequest>)+0x2ff) [0x55cda5c9903f]
10: (OSDMonitor::prepare_update(std::shared_ptr<MonOpRequest>)+0x24b) [0x55cda5c9958b]
11: (PaxosService::dispatch(std::shared_ptr<MonOpRequest>)+0xb4f) [0x55cda5c4c0af]
12: (PaxosService::C_RetryMessage::_finish(int)+0x58) [0x55cda5c4d698]
13: (C_MonOp::finish(int)+0x82) [0x55cda5c15862]
14: (Context::complete(int)+0x9) [0x55cda5c14949]
15: (void finish_contexts<Context>(CephContext*, std::__cxx11::list<Context*, std::allocator<Context*> >&, int)+0x1fb) [0x55cda5c1b25b]
16: (Paxos::finish_round()+0x287) [0x55cda5c41b17]
17: (Paxos::handle_last(std::shared_ptr<MonOpRequest>)+0xe19) [0x55cda5c42cf9]
18: (Paxos::dispatch(std::shared_ptr<MonOpRequest>)+0x250) [0x55cda5c43520]
19: (Monitor::dispatch_op(std::shared_ptr<MonOpRequest>)+0xa38) [0x55cda5c0ee68]
20: (Monitor::_ms_dispatch(Message*)+0x554) [0x55cda5c0f664]
21: (Monitor::ms_dispatch(Message*)+0x23) [0x55cda5c326f3]
22: (DispatchQueue::entry()+0xf2b) [0x55cda60aedfb]
23: (DispatchQueue::DispatchThread::entry()+0xd) [0x55cda5fa032d]
24: (()+0x76fa) [0x7f6e4165a6fa]
25: (clone()+0x6d) [0x7f6e3f916b5d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
0> 2016-06-29 14:10:30.346791 7f6e3951d700 -1 ** Caught signal (Aborted) *
in thread 7f6e3951d700 thread_name:ms_dispatch

ceph version 10.2.0 (3a9fba20ec743699b69bd0181dd6c54dc01c64b9)
1: (()+0x5233be) [0x55cda5e8b3be]
2: (()+0x113d0) [0x7f6e416643d0]
3: (gsignal()+0x38) [0x7f6e3f845418]
4: (abort()+0x16a) [0x7f6e3f84701a]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x26b) [0x55cda5fba18b]
6: (()+0x560833) [0x55cda5ec8833]
7: (CrushWrapper::move_bucket(CephContext*, int, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&)+0xda) [0x55cda5ec644a]
8: (OSDMonitor::prepare_command_impl(std::shared_ptr<MonOpRequest>, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > > > >&)+0x2cfe) [0x55cda5c8701e]
9: (OSDMonitor::prepare_command(std::shared_ptr<MonOpRequest>)+0x2ff) [0x55cda5c9903f]
10: (OSDMonitor::prepare_update(std::shared_ptr<MonOpRequest>)+0x24b) [0x55cda5c9958b]
11: (PaxosService::dispatch(std::shared_ptr<MonOpRequest>)+0xb4f) [0x55cda5c4c0af]
12: (PaxosService::C_RetryMessage::_finish(int)+0x58) [0x55cda5c4d698]
13: (C_MonOp::finish(int)+0x82) [0x55cda5c15862]
14: (Context::complete(int)+0x9) [0x55cda5c14949]
15: (void finish_contexts<Context>(CephContext*, std::__cxx11::list<Context*, std::allocator<Context*> >&, int)+0x1fb) [0x55cda5c1b25b]
16: (Paxos::finish_round()+0x287) [0x55cda5c41b17]
17: (Paxos::handle_last(std::shared_ptr<MonOpRequest>)+0xe19) [0x55cda5c42cf9]
18: (Paxos::dispatch(std::shared_ptr<MonOpRequest>)+0x250) [0x55cda5c43520]
19: (Monitor::dispatch_op(std::shared_ptr<MonOpRequest>)+0xa38) [0x55cda5c0ee68]
20: (Monitor::_ms_dispatch(Message*)+0x554) [0x55cda5c0f664]
21: (Monitor::ms_dispatch(Message*)+0x23) [0x55cda5c326f3]
22: (DispatchQueue::entry()+0xf2b) [0x55cda60aedfb]
23: (DispatchQueue::DispatchThread::entry()+0xd) [0x55cda5fa032d]
24: (()+0x76fa) [0x7f6e4165a6fa]
25: (clone()+0x6d) [0x7f6e3f916b5d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 xio
1/ 5 compressor
1/ 5 newstore
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
1/ 5 kinetic
1/ 5 fuse
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-mon.pp5.log
```

osd tree at the moment of crash:

ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-17 1.00000 root fast2500
-11 1.00000 host pp7
0 1.00000 osd.0 up 1.00000 1.00000
-5 9.12993 root ssd
-4 4.79999 host pp11
11 1.20000 osd.11 up 1.00000 1.00000
3 1.20000 osd.3 up 1.00000 1.00000
2 1.20000 osd.2 up 1.00000 1.00000
1 1.20000 osd.1 up 1.00000 1.00000
-8 0 host pp2
-7 0 host pp3
-12 0.25000 host pp4
8 0.25000 osd.8 up 1.00000 1.00000
-13 0.48000 host pp1
4 0.48000 osd.4 up 1.00000 1.00000
-2 0.09999 host c2
9 0.09999 osd.9 up 0.79999 1.00000
-1 0.09999 host c1
6 0.09999 osd.6 up 1.00000 1.00000
-3 0.09999 host c3
10 0.09999 osd.10 up 1.00000 1.00000
-6 0.70000 host c4
12 0.70000 osd.12 up 0.79999 1.00000
-9 0.45000 host c5
13 0.45000 osd.13 up 1.00000 1.00000
-10 0.45000 host c6
14 0.45000 osd.14 up 1.00000 1.00000
-14 0.45000 host c7
15 0.45000 osd.15 up 1.00000 1.00000
-15 0.79999 host c8
16 0.79999 osd.16 up 1.00000 1.00000
-16 0.45000 host c9
17 0.45000 osd.17 up 1.00000 1.00000

mon.tar.gz - dump from /var/lib/ceph/mon/ceph-pp5 (755 KB) George Shuklin, 06/29/2016 01:35 PM


Related issues

Copied to Backport #16583: jewel: mon crash: crush/CrushWrapper.h: 940: FAILED assert(successful_detach) Resolved
Copied to Backport #16584: hammer: mon crash: crush/CrushWrapper.h: 940: FAILED assert(successful_detach) Resolved

History

#1 Updated by George Shuklin 10 months ago

Ceph running on ubuntu 16.04, version 10.2.0-0ubuntu0.16.04.1

#2 Updated by George Shuklin 10 months ago

It continue to crash ater upgrade to 10.2.0-0ubuntu0.16.04.2.

I've attached content of the mon directory.

#3 Updated by Ian Colle 10 months ago

  • Assignee set to Kefu Chai

#4 Updated by George Shuklin 10 months ago

I can reproduce issue with version 10.2.2-0ubuntu1

(lauchpad bugreport: https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1597411)

#5 Updated by Kefu Chai 10 months ago

  • Status changed from New to Need Review
  • Backport set to jewel, hammer

#6 Updated by Kefu Chai 10 months ago

  • Status changed from Need Review to Pending Backport

#7 Updated by Nathan Cutler 10 months ago

  • Copied to Backport #16583: jewel: mon crash: crush/CrushWrapper.h: 940: FAILED assert(successful_detach) added

#8 Updated by Nathan Cutler 10 months ago

  • Copied to Backport #16584: hammer: mon crash: crush/CrushWrapper.h: 940: FAILED assert(successful_detach) added

#9 Updated by stephane beuret 9 months ago

Same issue with armhf packages.

     0> 2016-07-21 15:35:52.357458 72d9ad00 -1 *** Caught signal (Segmentation fault) **
 in thread 72d9ad00 thread_name:ms_dispatch

 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (()+0x3ebf7a) [0x54f87f7a]
 2: (()+0x25250) [0x76a0a250]

#10 Updated by Nathan Cutler 5 months ago

  • Status changed from Pending Backport to Resolved
  • Needs Doc set to No

Also available in: Atom PDF