Project

General

Profile

Actions

Bug #23386

closed

crush device class: Monitor Crash when moving Bucket into Default root

Added by Warren Jeffs about 6 years ago. Updated almost 6 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Administration/Usability
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When moving prestaged hosts with disks that out side of a root moving them into the root, causes the monitor to crash.

I have tried moving an empty rack into a building under the root and same issues occur, But I can move hosts between racks outside of the root. I have not tried to move stuff inside the root as this is a production cluster currently.

I did raise this via the mailing list but had no replies: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-March/025537.html

Mon crash dump of when moving a rack with a host into the correct building under the default root (names and IPs have been shortened):

 ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
 1: (()+0x8f59b1) [0x55f6c06079b1]
 2: (()+0xf5e0) [0x7f51c001c5e0]
 3: (CrushWrapper::device_class_clone(int, int, std::map<int, std::map<int, int, std::less<int>, std::allocator<std::pair<int const, int> > >, std::less<int>, std::allocator<std::pair<int const, std::map<int,
 int, std::less<int>, std::allocator<std::pair<int const, int> > > > > > const&, std::set<int, std::less<int>, std::allocator<int> > const&, int*, std::map<int, std::map<int, std::vector<int, std::allocator<i
nt> >, std::less<int>, std::allocator<std::pair<int const, std::vector<int, std::allocator<int> > > > >, std::less<int>, std::allocator<std::pair<int const, std::map<int, std::vector<int, std::allocator<int>
>, std::less<int>, std::allocator<std::pair<int const, std::vector<int, std::allocator<int> > > > > > > >*)+0xa87) [0x55f6c057fb27]
 4: (CrushWrapper::device_class_clone(int, int, std::map<int, std::map<int, int, std::less<int>, std::allocator<std::pair<int const, int> > >, std::less<int>, std::allocator<std::pair<int const, std::map<int,
 int, std::less<int>, std::allocator<std::pair<int const, int> > > > > > const&, std::set<int, std::less<int>, std::allocator<int> > const&, int*, std::map<int, std::map<int, std::vector<int, std::allocator<i
nt> >, std::less<int>, std::allocator<std::pair<int const, std::vector<int, std::allocator<int> > > > >, std::less<int>, std::allocator<std::pair<int const, std::map<int, std::vector<int, std::allocator<int>
>, std::less<int>, std::allocator<std::pair<int const, std::vector<int, std::allocator<int> > > > > > > >*)+0x305) [0x55f6c057f3a5]
 5: (CrushWrapper::device_class_clone(int, int, std::map<int, std::map<int, int, std::less<int>, std::allocator<std::pair<int const, int> > >, std::less<int>, std::allocator<std::pair<int const, std::map<int,
 int, std::less<int>, std::allocator<std::pair<int const, int> > > > > > const&, std::set<int, std::less<int>, std::allocator<int> > const&, int*, std::map<int, std::map<int, std::vector<int, std::allocator<i
nt> >, std::less<int>, std::allocator<std::pair<int const, std::vector<int, std::allocator<int> > > > >, std::less<int>, std::allocator<std::pair<int const, std::map<int, std::vector<int, std::allocator<int>
>, std::less<int>, std::allocator<std::pair<int const, std::vector<int, std::allocator<int> > > > > > > >*)+0x305) [0x55f6c057f3a5]
 6: (CrushWrapper::populate_classes(std::map<int, std::map<int, int, std::less<int>, std::allocator<std::pair<int const, int> > >, std::less<int>, std::allocator<std::pair<int const, std::map<int, int, std::l
ess<int>, std::allocator<std::pair<int const, int> > > > > > const&)+0x1cf) [0x55f6c058012f]
 7: (CrushWrapper::rebuild_roots_with_classes()+0xfe) [0x55f6c05802de]
 8: (CrushWrapper::insert_item(CephContext*, int, float, std::string, std::map<std::string, std::string, std::less<std::string>, std::allocator<std::pair<std::string const, std::string> > > const&)+0x7af) [0x
55f6c058203f]
 9: (CrushWrapper::move_bucket(CephContext*, int, std::map<std::string, std::string, std::less<std::string>, std::allocator<std::pair<std::string const, std::string> > > const&)+0xc1) [0x55f6c0582b41]
 10: (OSDMonitor::prepare_command_impl(boost::intrusive_ptr<MonOpRequest>, std::map<std::string, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, std::ve
ctor<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<std::string>, std::allocator<std::pair<std::string const, boost::variant<std::string, bool, long, double, std::vect
or<std::string, std::allocator<std::string> >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > >&)+0x4eee) [0x55f6c024a72e]
 11: (OSDMonitor::prepare_command(boost::intrusive_ptr<MonOpRequest>)+0x647) [0x55f6c0265807]
 12: (OSDMonitor::prepare_update(boost::intrusive_ptr<MonOpRequest>)+0x39e) [0x55f6c0265f6e]
 13: (PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0xaf8) [0x55f6c01f26a8]
 14: (Monitor::handle_command(boost::intrusive_ptr<MonOpRequest>)+0x1d3e) [0x55f6c00cd75e]
 15: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x919) [0x55f6c00d3009]
 16: (Monitor::_ms_dispatch(Message*)+0x7eb) [0x55f6c00d428b]
 17: (Monitor::handle_forward(boost::intrusive_ptr<MonOpRequest>)+0xa8d) [0x55f6c00d5b9d]
 18: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xdbd) [0x55f6c00d34ad]
 19: (Monitor::_ms_dispatch(Message*)+0x7eb) [0x55f6c00d428b]
 20: (Monitor::ms_dispatch(Message*)+0x23) [0x55f6c01003f3]
 21: (DispatchQueue::entry()+0x792) [0x55f6c05b2d92]
 22: (DispatchQueue::DispatchThread::entry()+0xd) [0x55f6c03aa7fd]
 23: (()+0x7e25) [0x7f51c0014e25]
 24: (clone()+0x6d) [0x7f51bd18c34d]

When I was doing this, The cluster status was Error, due to having backfill_full disks, I have got the cluster back to HEALTH_OK and the problem still persists.

I have tried running this from, 1 admin/mgr node, and 2 different monitors. If I leave the command going it will slowly take down all the monitors.

This error can be reproduced every time.


Related issues 3 (0 open3 closed)

Has duplicate Ceph - Bug #23836: Moving rack bucket to default root is not possibleDuplicate04/24/2018

Actions
Copied to RADOS - Backport #24258: luminous: crush device class: Monitor Crash when moving Bucket into Default rootResolvedPrashant DActions
Copied to RADOS - Backport #24259: mimic: crush device class: Monitor Crash when moving Bucket into Default rootResolvedKefu ChaiActions
Actions #1

Updated by Warren Jeffs about 6 years ago

Appears Paul Emmerich has found the problem and its down the weights.

The email chain can be seen from the mailing list here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-March/025537.html

Actions #2

Updated by Warren Jeffs about 6 years ago

Appears the error is with calculating the host weight.

It has set it at 43.664 when it should be set to 43.668

I set the correct weight, and removed the choose_args section from the map, testing on another test cluster it imports correctly, and I was able to move the buckets as needed.

Actions #3

Updated by Patrick Donnelly about 6 years ago

  • Project changed from Ceph to rgw
  • Subject changed from Monitor Crash when moving Bucket into Default root (12.2.4) to rgw: Monitor Crash when moving Bucket into Default root
  • Category deleted (Monitor)
  • Source set to Community (user)
  • Backport set to luminous
  • Release deleted (luminous)
  • ceph-qa-suite deleted (ceph-deploy)
Actions #4

Updated by Patrick Donnelly about 6 years ago

  • Project changed from rgw to RADOS
  • Subject changed from rgw: Monitor Crash when moving Bucket into Default root to CRUSH: Monitor Crash when moving Bucket into Default root
  • Component(RADOS) CRUSH added
Actions #5

Updated by John Spray almost 6 years ago

  • Description updated (diff)

(Pulling backtrace into the ticket)

Actions #6

Updated by Greg Farnum almost 6 years ago

  • Subject changed from CRUSH: Monitor Crash when moving Bucket into Default root to crush device class: Monitor Crash when moving Bucket into Default root
  • Category set to Administration/Usability
  • Priority changed from Normal to High
  • Component(RADOS) Monitor added
  • Component(RADOS) deleted (CRUSH)

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-March/025569.html
Paul Emmerich wrote:

looks like it fails to adjust the number of weight set entries when moving the entries. The good news is that this is 100% reproducible with your crush map:
you should open a bug at http://tracker.ceph.com/ to get this fixed.

Deleting the weight set fixes the problem. Moving the item manually with manual adjustment of the weight set also works in my quick test."

Actions #7

Updated by Greg Farnum almost 6 years ago

  • Has duplicate Bug #23836: Moving rack bucket to default root is not possible added
Actions #8

Updated by Jarek Owsiewski almost 6 years ago

Any update? Mentioned workaround is not good idea for us.

Actions #9

Updated by Sage Weil almost 6 years ago

  • Status changed from New to 12
  • Assignee set to Sage Weil

I suspect the recent pr https://github.com/ceph/ceph/pull/22091 fixed this, but figuring out how to reproduce to be sure.

Actions #10

Updated by Sage Weil almost 6 years ago

reproduces on luminous with

bin/init-ceph stop ; MON=1 OSD=3 MDS=0 ../src/vstart.sh  -d -n -x -l
bin/ceph osd crush weight-set create-compat
bin/ceph osd crush add-bucket foo rack
bin/ceph osd crush move foo root=default
Actions #11

Updated by Sage Weil almost 6 years ago

  • Status changed from 12 to Fix Under Review
  • Backport changed from luminous to mimic,luminous
Actions #12

Updated by Kefu Chai almost 6 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #13

Updated by Nathan Cutler almost 6 years ago

  • Copied to Backport #24258: luminous: crush device class: Monitor Crash when moving Bucket into Default root added
Actions #14

Updated by Nathan Cutler almost 6 years ago

  • Copied to Backport #24259: mimic: crush device class: Monitor Crash when moving Bucket into Default root added
Actions #15

Updated by Nathan Cutler almost 6 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF