Bug #23386
closedcrush device class: Monitor Crash when moving Bucket into Default root
0%
Description
When moving prestaged hosts with disks that out side of a root moving them into the root, causes the monitor to crash.
I have tried moving an empty rack into a building under the root and same issues occur, But I can move hosts between racks outside of the root. I have not tried to move stuff inside the root as this is a production cluster currently.
I did raise this via the mailing list but had no replies: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-March/025537.html
Mon crash dump of when moving a rack with a host into the correct building under the default root (names and IPs have been shortened):
ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable) 1: (()+0x8f59b1) [0x55f6c06079b1] 2: (()+0xf5e0) [0x7f51c001c5e0] 3: (CrushWrapper::device_class_clone(int, int, std::map<int, std::map<int, int, std::less<int>, std::allocator<std::pair<int const, int> > >, std::less<int>, std::allocator<std::pair<int const, std::map<int, int, std::less<int>, std::allocator<std::pair<int const, int> > > > > > const&, std::set<int, std::less<int>, std::allocator<int> > const&, int*, std::map<int, std::map<int, std::vector<int, std::allocator<i nt> >, std::less<int>, std::allocator<std::pair<int const, std::vector<int, std::allocator<int> > > > >, std::less<int>, std::allocator<std::pair<int const, std::map<int, std::vector<int, std::allocator<int> >, std::less<int>, std::allocator<std::pair<int const, std::vector<int, std::allocator<int> > > > > > > >*)+0xa87) [0x55f6c057fb27] 4: (CrushWrapper::device_class_clone(int, int, std::map<int, std::map<int, int, std::less<int>, std::allocator<std::pair<int const, int> > >, std::less<int>, std::allocator<std::pair<int const, std::map<int, int, std::less<int>, std::allocator<std::pair<int const, int> > > > > > const&, std::set<int, std::less<int>, std::allocator<int> > const&, int*, std::map<int, std::map<int, std::vector<int, std::allocator<i nt> >, std::less<int>, std::allocator<std::pair<int const, std::vector<int, std::allocator<int> > > > >, std::less<int>, std::allocator<std::pair<int const, std::map<int, std::vector<int, std::allocator<int> >, std::less<int>, std::allocator<std::pair<int const, std::vector<int, std::allocator<int> > > > > > > >*)+0x305) [0x55f6c057f3a5] 5: (CrushWrapper::device_class_clone(int, int, std::map<int, std::map<int, int, std::less<int>, std::allocator<std::pair<int const, int> > >, std::less<int>, std::allocator<std::pair<int const, std::map<int, int, std::less<int>, std::allocator<std::pair<int const, int> > > > > > const&, std::set<int, std::less<int>, std::allocator<int> > const&, int*, std::map<int, std::map<int, std::vector<int, std::allocator<i nt> >, std::less<int>, std::allocator<std::pair<int const, std::vector<int, std::allocator<int> > > > >, std::less<int>, std::allocator<std::pair<int const, std::map<int, std::vector<int, std::allocator<int> >, std::less<int>, std::allocator<std::pair<int const, std::vector<int, std::allocator<int> > > > > > > >*)+0x305) [0x55f6c057f3a5] 6: (CrushWrapper::populate_classes(std::map<int, std::map<int, int, std::less<int>, std::allocator<std::pair<int const, int> > >, std::less<int>, std::allocator<std::pair<int const, std::map<int, int, std::l ess<int>, std::allocator<std::pair<int const, int> > > > > > const&)+0x1cf) [0x55f6c058012f] 7: (CrushWrapper::rebuild_roots_with_classes()+0xfe) [0x55f6c05802de] 8: (CrushWrapper::insert_item(CephContext*, int, float, std::string, std::map<std::string, std::string, std::less<std::string>, std::allocator<std::pair<std::string const, std::string> > > const&)+0x7af) [0x 55f6c058203f] 9: (CrushWrapper::move_bucket(CephContext*, int, std::map<std::string, std::string, std::less<std::string>, std::allocator<std::pair<std::string const, std::string> > > const&)+0xc1) [0x55f6c0582b41] 10: (OSDMonitor::prepare_command_impl(boost::intrusive_ptr<MonOpRequest>, std::map<std::string, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, std::ve ctor<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<std::string>, std::allocator<std::pair<std::string const, boost::variant<std::string, bool, long, double, std::vect or<std::string, std::allocator<std::string> >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > >&)+0x4eee) [0x55f6c024a72e] 11: (OSDMonitor::prepare_command(boost::intrusive_ptr<MonOpRequest>)+0x647) [0x55f6c0265807] 12: (OSDMonitor::prepare_update(boost::intrusive_ptr<MonOpRequest>)+0x39e) [0x55f6c0265f6e] 13: (PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0xaf8) [0x55f6c01f26a8] 14: (Monitor::handle_command(boost::intrusive_ptr<MonOpRequest>)+0x1d3e) [0x55f6c00cd75e] 15: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x919) [0x55f6c00d3009] 16: (Monitor::_ms_dispatch(Message*)+0x7eb) [0x55f6c00d428b] 17: (Monitor::handle_forward(boost::intrusive_ptr<MonOpRequest>)+0xa8d) [0x55f6c00d5b9d] 18: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xdbd) [0x55f6c00d34ad] 19: (Monitor::_ms_dispatch(Message*)+0x7eb) [0x55f6c00d428b] 20: (Monitor::ms_dispatch(Message*)+0x23) [0x55f6c01003f3] 21: (DispatchQueue::entry()+0x792) [0x55f6c05b2d92] 22: (DispatchQueue::DispatchThread::entry()+0xd) [0x55f6c03aa7fd] 23: (()+0x7e25) [0x7f51c0014e25] 24: (clone()+0x6d) [0x7f51bd18c34d]
When I was doing this, The cluster status was Error, due to having backfill_full disks, I have got the cluster back to HEALTH_OK and the problem still persists.
I have tried running this from, 1 admin/mgr node, and 2 different monitors. If I leave the command going it will slowly take down all the monitors.
This error can be reproduced every time.
Updated by Warren Jeffs about 6 years ago
Appears Paul Emmerich has found the problem and its down the weights.
The email chain can be seen from the mailing list here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-March/025537.html
Updated by Warren Jeffs about 6 years ago
Appears the error is with calculating the host weight.
It has set it at 43.664 when it should be set to 43.668
I set the correct weight, and removed the choose_args section from the map, testing on another test cluster it imports correctly, and I was able to move the buckets as needed.
Updated by Patrick Donnelly about 6 years ago
- Project changed from Ceph to rgw
- Subject changed from Monitor Crash when moving Bucket into Default root (12.2.4) to rgw: Monitor Crash when moving Bucket into Default root
- Category deleted (
Monitor) - Source set to Community (user)
- Backport set to luminous
- Release deleted (
luminous) - ceph-qa-suite deleted (
ceph-deploy)
Updated by Patrick Donnelly about 6 years ago
- Project changed from rgw to RADOS
- Subject changed from rgw: Monitor Crash when moving Bucket into Default root to CRUSH: Monitor Crash when moving Bucket into Default root
- Component(RADOS) CRUSH added
Updated by John Spray almost 6 years ago
- Description updated (diff)
(Pulling backtrace into the ticket)
Updated by Greg Farnum almost 6 years ago
- Subject changed from CRUSH: Monitor Crash when moving Bucket into Default root to crush device class: Monitor Crash when moving Bucket into Default root
- Category set to Administration/Usability
- Priority changed from Normal to High
- Component(RADOS) Monitor added
- Component(RADOS) deleted (
CRUSH)
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-March/025569.html
Paul Emmerich wrote:
looks like it fails to adjust the number of weight set entries when moving the entries. The good news is that this is 100% reproducible with your crush map:
you should open a bug at http://tracker.ceph.com/ to get this fixed.Deleting the weight set fixes the problem. Moving the item manually with manual adjustment of the weight set also works in my quick test."
Updated by Greg Farnum almost 6 years ago
- Has duplicate Bug #23836: Moving rack bucket to default root is not possible added
Updated by Jarek Owsiewski almost 6 years ago
Any update? Mentioned workaround is not good idea for us.
Updated by Sage Weil almost 6 years ago
- Status changed from New to 12
- Assignee set to Sage Weil
I suspect the recent pr https://github.com/ceph/ceph/pull/22091 fixed this, but figuring out how to reproduce to be sure.
Updated by Sage Weil almost 6 years ago
reproduces on luminous with
bin/init-ceph stop ; MON=1 OSD=3 MDS=0 ../src/vstart.sh -d -n -x -l bin/ceph osd crush weight-set create-compat bin/ceph osd crush add-bucket foo rack bin/ceph osd crush move foo root=default
Updated by Sage Weil almost 6 years ago
- Status changed from 12 to Fix Under Review
- Backport changed from luminous to mimic,luminous
Updated by Kefu Chai almost 6 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Nathan Cutler almost 6 years ago
- Copied to Backport #24258: luminous: crush device class: Monitor Crash when moving Bucket into Default root added
Updated by Nathan Cutler almost 6 years ago
- Copied to Backport #24259: mimic: crush device class: Monitor Crash when moving Bucket into Default root added
Updated by Nathan Cutler almost 6 years ago
- Status changed from Pending Backport to Resolved