Project

General

Profile

Bug #23360

call to 'ceph osd erasure-code-profile set' asserts the monitors

Added by Sebastian Wagner over 2 years ago. Updated over 2 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature:

Description

I've attached `thread apply all bt` mixed with `thread apply all py-bt`

Threads 38 35 34 32 and 31 are waiting for futex 0x55a285204640

Thread 37 waits in
File "/src/pybind/mgr/mgr_module.py", line 71, in wait
self.ev.wait()

AFAICT, all other threads are not part of this deadlock.

deadlock.txt View (651 KB) Sebastian Wagner, 03/14/2018 02:33 PM


Related issues

Duplicates RADOS - Bug #23345: `ceph osd erasure-code-profile set` crashes the monitors on vstart clusters Resolved 03/13/2018

History

#1 Updated by Ricardo Dias over 2 years ago

Could you point to the code, or provide a small python example, that triggers this deadlock?

#2 Updated by Sebastian Wagner over 2 years ago

The send_command() function visible in this traceback is: https://github.com/ceph/ceph/pull/20865/files#diff-188b91d966a54c045089a507b218e586R108

The code of this deadlock including erasure_code_profile.py is here: https://github.com/sebastian-philipp/ceph/commit/3444cf5f2c22a35bf411ecb566f7fa1ba5670bc7

#3 Updated by Sebastian Wagner over 2 years ago

Hm. quite possible that this is in fact not a classc deadlock.

Turns out, the `ceph` command line tool is also broken:

$ ceph osd erasure-code-profile set threetwo k=3 m=2 

^CError EINTR: Interrupted!

Environment:

Fresh build on git commit on the latest master (619d435a) with a vstart.sh cluster (ceph version 13.0.1-3034)

#4 Updated by Sebastian Wagner over 2 years ago

  • Project changed from mgr to RADOS
  • Category deleted (python interface)

Found the cause of this. From the mon.a.log:

5> 2018-03-14 16:55:26.470 7f61aa02a700 10 log_client  will send 2018-03-14 16:55:26.473781 mon.a mon.0 192.168.178.29:40706/0 93 : audit [INF] from='client.154151 -' entity='client.admin' cmd=[{"profile": ["k=3", "m=3"], "prefix": "osd erasure-code-profile set", "name": "myprofile"}]: dispatch
... snip ...
/home/sebastian/Repos/ceph/src/mon/OSDMonitor.cc: 5501: FAILED assert((*erasure_code_profile_map).count("plugin"))

 ceph version 13.0.1-3034-g619d435a71 (619d435a71f002571bda0c71dd26d75deaf480c3) mimic (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x137) [0x7f61b722c9fb]
 2: (OSDMonitor::parse_erasure_code_profile(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >*, std::ostream*)+0x145) [0x5584109774ef]
 3: (OSDMonitor::prepare_command_impl(boost::intrusive_ptr<MonOpRequest>, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<

#5 Updated by Sebastian Wagner over 2 years ago

A proper fix would be to provide a proper error message in OSDMonitor::parse_erasure_code_profile instead of asserting, when the list of ecp plugins is empty.

#6 Updated by Joao Eduardo Luis over 2 years ago

  • Status changed from New to Duplicate

#7 Updated by Joao Eduardo Luis over 2 years ago

  • Duplicates Bug #23345: `ceph osd erasure-code-profile set` crashes the monitors on vstart clusters added

#8 Updated by Joao Eduardo Luis over 2 years ago

  • Subject changed from Python Deadlock in mgr_module.CommandResult#wait to call to 'ceph osd erasure-code-profile set' asserts the monitors

Also available in: Atom PDF