Bug #23386: crush device class: Monitor Crash when moving Bucket into Default root - RADOS - Ceph

Actions

Copy link

Bug #23386

closed

crush device class: Monitor Crash when moving Bucket into Default root

Added by Warren Jeffs about 6 years ago. Updated almost 6 years ago.

Status:

Resolved

Priority:

High

Assignee:

Sage Weil

Category:

Administration/Usability

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

mimic,luminous

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v12.2.4

ceph-qa-suite:

Component(RADOS):

Monitor

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

When moving prestaged hosts with disks that out side of a root moving them into the root, causes the monitor to crash.

I have tried moving an empty rack into a building under the root and same issues occur, But I can move hosts between racks outside of the root. I have not tried to move stuff inside the root as this is a production cluster currently.

I did raise this via the mailing list but had no replies: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-March/025537.html

Mon crash dump of when moving a rack with a host into the correct building under the default root (names and IPs have been shortened):

 ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
 1: (()+0x8f59b1) [0x55f6c06079b1]
 2: (()+0xf5e0) [0x7f51c001c5e0]
 3: (CrushWrapper::device_class_clone(int, int, std::map<int, std::map<int, int, std::less<int>, std::allocator<std::pair<int const, int> > >, std::less<int>, std::allocator<std::pair<int const, std::map<int,
 int, std::less<int>, std::allocator<std::pair<int const, int> > > > > > const&, std::set<int, std::less<int>, std::allocator<int> > const&, int*, std::map<int, std::map<int, std::vector<int, std::allocator<i
nt> >, std::less<int>, std::allocator<std::pair<int const, std::vector<int, std::allocator<int> > > > >, std::less<int>, std::allocator<std::pair<int const, std::map<int, std::vector<int, std::allocator<int>
>, std::less<int>, std::allocator<std::pair<int const, std::vector<int, std::allocator<int> > > > > > > >*)+0xa87) [0x55f6c057fb27]
 4: (CrushWrapper::device_class_clone(int, int, std::map<int, std::map<int, int, std::less<int>, std::allocator<std::pair<int const, int> > >, std::less<int>, std::allocator<std::pair<int const, std::map<int,
 int, std::less<int>, std::allocator<std::pair<int const, int> > > > > > const&, std::set<int, std::less<int>, std::allocator<int> > const&, int*, std::map<int, std::map<int, std::vector<int, std::allocator<i
nt> >, std::less<int>, std::allocator<std::pair<int const, std::vector<int, std::allocator<int> > > > >, std::less<int>, std::allocator<std::pair<int const, std::map<int, std::vector<int, std::allocator<int>
>, std::less<int>, std::allocator<std::pair<int const, std::vector<int, std::allocator<int> > > > > > > >*)+0x305) [0x55f6c057f3a5]
 5: (CrushWrapper::device_class_clone(int, int, std::map<int, std::map<int, int, std::less<int>, std::allocator<std::pair<int const, int> > >, std::less<int>, std::allocator<std::pair<int const, std::map<int,
 int, std::less<int>, std::allocator<std::pair<int const, int> > > > > > const&, std::set<int, std::less<int>, std::allocator<int> > const&, int*, std::map<int, std::map<int, std::vector<int, std::allocator<i
nt> >, std::less<int>, std::allocator<std::pair<int const, std::vector<int, std::allocator<int> > > > >, std::less<int>, std::allocator<std::pair<int const, std::map<int, std::vector<int, std::allocator<int>
>, std::less<int>, std::allocator<std::pair<int const, std::vector<int, std::allocator<int> > > > > > > >*)+0x305) [0x55f6c057f3a5]
 6: (CrushWrapper::populate_classes(std::map<int, std::map<int, int, std::less<int>, std::allocator<std::pair<int const, int> > >, std::less<int>, std::allocator<std::pair<int const, std::map<int, int, std::l
ess<int>, std::allocator<std::pair<int const, int> > > > > > const&)+0x1cf) [0x55f6c058012f]
 7: (CrushWrapper::rebuild_roots_with_classes()+0xfe) [0x55f6c05802de]
 8: (CrushWrapper::insert_item(CephContext*, int, float, std::string, std::map<std::string, std::string, std::less<std::string>, std::allocator<std::pair<std::string const, std::string> > > const&)+0x7af) [0x
55f6c058203f]
 9: (CrushWrapper::move_bucket(CephContext*, int, std::map<std::string, std::string, std::less<std::string>, std::allocator<std::pair<std::string const, std::string> > > const&)+0xc1) [0x55f6c0582b41]
 10: (OSDMonitor::prepare_command_impl(boost::intrusive_ptr<MonOpRequest>, std::map<std::string, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, std::ve
ctor<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<std::string>, std::allocator<std::pair<std::string const, boost::variant<std::string, bool, long, double, std::vect
or<std::string, std::allocator<std::string> >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > >&)+0x4eee) [0x55f6c024a72e]
 11: (OSDMonitor::prepare_command(boost::intrusive_ptr<MonOpRequest>)+0x647) [0x55f6c0265807]
 12: (OSDMonitor::prepare_update(boost::intrusive_ptr<MonOpRequest>)+0x39e) [0x55f6c0265f6e]
 13: (PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0xaf8) [0x55f6c01f26a8]
 14: (Monitor::handle_command(boost::intrusive_ptr<MonOpRequest>)+0x1d3e) [0x55f6c00cd75e]
 15: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x919) [0x55f6c00d3009]
 16: (Monitor::_ms_dispatch(Message*)+0x7eb) [0x55f6c00d428b]
 17: (Monitor::handle_forward(boost::intrusive_ptr<MonOpRequest>)+0xa8d) [0x55f6c00d5b9d]
 18: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xdbd) [0x55f6c00d34ad]
 19: (Monitor::_ms_dispatch(Message*)+0x7eb) [0x55f6c00d428b]
 20: (Monitor::ms_dispatch(Message*)+0x23) [0x55f6c01003f3]
 21: (DispatchQueue::entry()+0x792) [0x55f6c05b2d92]
 22: (DispatchQueue::DispatchThread::entry()+0xd) [0x55f6c03aa7fd]
 23: (()+0x7e25) [0x7f51c0014e25]
 24: (clone()+0x6d) [0x7f51bd18c34d]

When I was doing this, The cluster status was Error, due to having backfill_full disks, I have got the cluster back to HEALTH_OK and the problem still persists.

I have tried running this from, 1 admin/mgr node, and 2 different monitors. If I leave the command going it will slowly take down all the monitors.

This error can be reproduced every time.

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by Warren Jeffs about 6 years ago

Appears Paul Emmerich has found the problem and its down the weights.

The email chain can be seen from the mailing list here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-March/025537.html

Actions

Copy link

Updated by Warren Jeffs about 6 years ago

Appears the error is with calculating the host weight.

It has set it at 43.664 when it should be set to 43.668

I set the correct weight, and removed the choose_args section from the map, testing on another test cluster it imports correctly, and I was able to move the buckets as needed.

Actions

Copy link

Updated by Patrick Donnelly about 6 years ago

Project changed from Ceph to rgw
Subject changed from Monitor Crash when moving Bucket into Default root (12.2.4) to rgw: Monitor Crash when moving Bucket into Default root
Category deleted (~~Monitor~~)
Source set to Community (user)
Backport set to luminous
Release deleted (~~luminous~~)
ceph-qa-suite deleted (~~ceph-deploy~~)

Actions

Copy link

Updated by Patrick Donnelly about 6 years ago

Project changed from rgw to RADOS
Subject changed from rgw: Monitor Crash when moving Bucket into Default root to CRUSH: Monitor Crash when moving Bucket into Default root
Component(RADOS) CRUSH added

Actions

Copy link

Updated by John Spray almost 6 years ago

Description updated (diff)

(Pulling backtrace into the ticket)

Actions

Copy link

Updated by Greg Farnum almost 6 years ago

Subject changed from CRUSH: Monitor Crash when moving Bucket into Default root to crush device class: Monitor Crash when moving Bucket into Default root
Category set to Administration/Usability
Priority changed from Normal to High
Component(RADOS) Monitor added
Component(RADOS) deleted (~~CRUSH~~)

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-March/025569.html
Paul Emmerich wrote:

looks like it fails to adjust the number of weight set entries when moving the entries. The good news is that this is 100% reproducible with your crush map:
you should open a bug at http://tracker.ceph.com/ to get this fixed.

Deleting the weight set fixes the problem. Moving the item manually with manual adjustment of the weight set also works in my quick test."

Actions

Copy link

Updated by Greg Farnum almost 6 years ago

Has duplicate Bug #23836: Moving rack bucket to default root is not possible added

Actions

Copy link

Updated by Jarek Owsiewski almost 6 years ago

Any update? Mentioned workaround is not good idea for us.

Actions

Copy link

Updated by Sage Weil almost 6 years ago

Status changed from New to 12
Assignee set to Sage Weil

I suspect the recent pr https://github.com/ceph/ceph/pull/22091 fixed this, but figuring out how to reproduce to be sure.

Actions

Copy link

#10

Updated by Sage Weil almost 6 years ago

reproduces on luminous with

bin/init-ceph stop ; MON=1 OSD=3 MDS=0 ../src/vstart.sh  -d -n -x -l
bin/ceph osd crush weight-set create-compat
bin/ceph osd crush add-bucket foo rack
bin/ceph osd crush move foo root=default

Actions

Copy link

#11