Project

General

Profile

Bug #21016

CRUSH crash on bad memory handling

Added by Greg Farnum about 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
Correctness/Safety
Target version:
-
Start date:
08/16/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
CRUSH, Monitor
Pull request ID:
Crash signature:

Description

2017-08-14T21:48:50.221 INFO:tasks.ceph.mon.f.smithi115.stderr:src/tcmalloc.cc:282] Attempt to realloc invalid pointer 0x7d202020200a7d20
2017-08-14T21:48:50.224 INFO:tasks.ceph.mon.f.smithi115.stderr:*** Caught signal (Aborted) **
2017-08-14T21:48:50.226 INFO:tasks.ceph.mon.f.smithi115.stderr: in thread 7f7a19565700 thread_name:ms_dispatch
2017-08-14T21:48:50.228 INFO:tasks.ceph.mon.f.smithi115.stderr: ceph version 12.1.3-18-ge1ec9fb (e1ec9fb9cb00e121f9edf518e53bab423df28e01) luminous (rc)
2017-08-14T21:48:50.232 INFO:tasks.ceph.mon.f.smithi115.stderr: 1: (()+0x9497f4) [0x564000cf57f4]
2017-08-14T21:48:50.239 INFO:tasks.ceph.mon.f.smithi115.stderr: 2: (()+0x11390) [0x7f7a20ab5390]
2017-08-14T21:48:50.256 INFO:tasks.ceph.mon.f.smithi115.stderr: 3: (gsignal()+0x38) [0x7f7a1f203428]
2017-08-14T21:48:50.264 INFO:tasks.ceph.mon.f.smithi115.stderr: 4: (abort()+0x16a) [0x7f7a1f20502a]
2017-08-14T21:48:50.271 INFO:tasks.ceph.mon.f.smithi115.stderr: 5: (tcmalloc::Log(tcmalloc::LogMode, char const*, int, tcmalloc::LogItem, tcmalloc::LogItem, tcmalloc::LogItem, tcmalloc::LogItem)+0x22e) [0x7f7a200725ce]
2017-08-14T21:48:50.279 INFO:tasks.ceph.mon.f.smithi115.stderr: 6: (()+0x137cf) [0x7f7a200667cf]
2017-08-14T21:48:50.292 INFO:tasks.ceph.mon.f.smithi115.stderr: 7: (tc_realloc()+0x1cd) [0x7f7a200860ed]
2017-08-14T21:48:50.299 INFO:tasks.ceph.mon.f.smithi115.stderr: 8: (CrushWrapper::bucket_add_item(crush_bucket*, int, int)+0x11b) [0x564000c5b79b]
2017-08-14T21:48:50.310 INFO:tasks.ceph.mon.f.smithi115.stderr: 9: (CrushWrapper::device_class_clone(int, int, std::map<int, std::map<int, int, std::less<int>, std::allocator<std::pair<int const, int> > >, std::less<int>, std::allocator<std::pair<int const, std::map<int, int, std::less<int>, std::allocator<std::pair<int const, int> > > > > > const&, std::set<int, std::less<int>, std::allocator<int> > const&, int*)+0x4fa) [0x564000c6a8da]
2017-08-14T21:48:50.675 INFO:tasks.ceph.mon.f.smithi115.stderr: 10: (CrushWrapper::populate_classes(std::map<int, std::map<int, int, std::less<int>, std::allocator<std::pair<int const, int> > >, std::less<int>, std::allocator<std::pair<int const, std::map<int, int, std::less<int>, std::allocator<std::pair<int const, int> > > > > > const&)+0x183) [0x564000c6ada3]
2017-08-14T21:48:50.726 INFO:tasks.ceph.mon.f.smithi115.stderr: 11: (CrushWrapper::rebuild_roots_with_classes()+0x108) [0x564000c6b3f8]
2017-08-14T21:48:50.784 INFO:tasks.ceph.mon.f.smithi115.stderr: 12: (CrushWrapper::update_device_class(int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::ostream*)+0x13b) [0x564000c6ed2b]
2017-08-14T21:48:50.792 INFO:tasks.ceph.mon.f.smithi115.stderr: 13: (OSDMonitor::prepare_command_impl(boost::intrusive_ptr<MonOpRequest>, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > >&)+0x1bb3) [0x5640008de213]
2017-08-14T21:48:50.800 INFO:tasks.ceph.mon.f.smithi115.stderr: 14: (OSDMonitor::prepare_command(boost::intrusive_ptr<MonOpRequest>)+0x2b6) [0x564000901466]
2017-08-14T21:48:50.810 INFO:tasks.ceph.mon.f.smithi115.stderr: 15: (OSDMonitor::prepare_update(boost::intrusive_ptr<MonOpRequest>)+0x280) [0x5640009018a0]
2017-08-14T21:48:50.817 INFO:tasks.ceph.mon.f.smithi115.stderr: 16: (PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x9b4) [0x5640008882e4]
2017-08-14T21:48:50.825 INFO:tasks.ceph.mon.f.smithi115.stderr: 17: (Monitor::handle_command(boost::intrusive_ptr<MonOpRequest>)+0x235a) [0x56400074eaea]
2017-08-14T21:48:50.830 INFO:tasks.ceph.mon.f.smithi115.stderr: 18: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xa0e) [0x564000755aae]
2017-08-14T21:48:50.835 INFO:tasks.ceph.mon.f.smithi115.stderr: 19: (Monitor::_ms_dispatch(Message*)+0x6db) [0x564000756afb]
2017-08-14T21:48:50.838 INFO:tasks.ceph.mon.f.smithi115.stderr: 20: (Monitor::handle_forward(boost::intrusive_ptr<MonOpRequest>)+0x826) [0x5640007584f6]
2017-08-14T21:48:50.841 INFO:tasks.ceph.mon.f.smithi115.stderr: 21: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xcfa) [0x564000755d9a]
2017-08-14T21:48:50.843 INFO:tasks.ceph.mon.f.smithi115.stderr: 22: (Monitor::_ms_dispatch(Message*)+0x6db) [0x564000756afb]
2017-08-14T21:48:50.850 INFO:tasks.ceph.mon.f.smithi115.stderr: 23: (Monitor::ms_dispatch(Message*)+0x23) [0x564000785033]
2017-08-14T21:48:50.853 INFO:tasks.ceph.mon.f.smithi115.stderr: 24: (DispatchQueue::entry()+0xf4a) [0x564000c9de3a]
2017-08-14T21:48:50.856 INFO:tasks.ceph.mon.f.smithi115.stderr: 25: (DispatchQueue::DispatchThread::entry()+0xd) [0x564000a54c7d]
2017-08-14T21:48:50.858 INFO:tasks.ceph.mon.f.smithi115.stderr: 26: (()+0x76ba) [0x7f7a20aab6ba]
2017-08-14T21:48:50.861 INFO:tasks.ceph.mon.f.smithi115.stderr: 27: (clone()+0x6d) [0x7f7a1f2d53dd]

This happened in teuthology-2017-08-13_02:30:06-rados-luminous-distro-basic-smithi/1520040 (and most of the other jobs marked as dead there). It appears to be triggered by any of the rados_mon_workunits.yaml jobs. I also see a few of that fragment in "dead" jobs from the previous rados luminous run, though not the one before that.

I do see https://github.com/ceph/ceph/pull/16805 involved the device class code, although it was long enough ago that it should have prompted earlier errors so maybe something else did it.


Related issues

Copied to RADOS - Backport #21106: luminous: CRUSH crash on bad memory handling Resolved

History

#1 Updated by Greg Farnum about 2 years ago

...and this was also responsible for at least a couple failures that got detected as such.

#2 Updated by xie xingguo about 2 years ago

I believe this should be fixed by https://github.com/ceph/ceph/pull/17014/commits/6252068ec08c66513e5394188b7869782367f8bd,
and there are several similar issues, see https://github.com/ceph/ceph/pull/17014 for details.

#3 Updated by Kefu Chai about 2 years ago

  • Status changed from New to Need Review
  • Assignee set to xie xingguo

#4 Updated by Sage Weil about 2 years ago

  • Status changed from Need Review to Testing

#5 Updated by xie xingguo about 2 years ago

  • Status changed from Testing to Pending Backport
  • Backport set to luminous

#6 Updated by Nathan Cutler about 2 years ago

  • Copied to Backport #21106: luminous: CRUSH crash on bad memory handling added

#7 Updated by xie xingguo about 2 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF