Project

General

Profile

Bug #58156

Monitors do not permit OSD to join after upgrading to Quincy

Added by Igor Fedotov 2 months ago. Updated about 2 months ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
quincy, pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

The Nautilus cluster has been eventually upgraded to Quincy and at the end OSDs stopped joining the cluster.

The investigation revealed that monitor doesn't permit OSD joining with the following error:
mon.ceph1-1 (mon.2) 147 : cluster [INF] disallowing boot of quincy+ OSD osd.16 v2:xx.xx.xx.xx:xxx/xxxx because require_osd_release < octopus

and
the current require-osd-release is still at nautilus - user missed the relevant step to tune it up during the upgrade...

An attempt to tune require-osd-release up to resolve the issue causes the following exception at monitors:
2022-12-01T08:00:38.849-0500 7f6517e7c700 -1 /build/ceph-17.2.5/src/mon/OSDMonitor.cc: In function 'bool OSDMonitor::prepare_command_impl(MonOpRequestRef, const
cmdmap_t&)' thread 7f6517e7c700 time 2022-12-01T08:00:38.838800-0500
/build/ceph-17.2.5/src/mon/OSDMonitor.cc: 11618: FAILED ceph_assert(osdmap.require_osd_release >= ceph_release_t::octopus)

ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14f) [0x7f65200387c6]
2: /usr/lib/ceph/libceph-common.so.2(+0x27c9d8) [0x7f65200389d8]
3: (OSDMonitor::prepare_command_impl(boost::intrusive_ptr&lt;MonOpRequest&gt;, std::map&lt;std::__cxx11::basic_string&lt;char, std::char_traits&lt;char&gt;, std::allocator&lt;char&gt;
>, boost::variant&lt;std::__cxx11::basic_string&lt;char, std::char_traits&lt;char&gt;, std::allocator&lt;char&gt; >, bool, long, double, std::vector&lt;std::__cxx11::basic_string&lt;c
har, std::char_traits&lt;char&gt;, std::allocator&lt;char&gt; >, std::allocator&lt;std::__cxx11::basic_string&lt;char, std::char_traits&lt;char&gt;, std::allocator&lt;char&gt; > > >, std::ve
ctor&lt;long, std::allocator&lt;long&gt; >, std::vector&lt;double, std::allocator&lt;double&gt; > >, std::less&lt;void&gt;, std::allocator&lt;std::pair&lt;std::__cxx11::basic_string&lt;char, st
d::char_traits&lt;char&gt;, std::allocator&lt;char&gt; > const, boost::variant&lt;std::__cxx11::basic_string&lt;char, std::char_traits&lt;char&gt;, std::allocator&lt;char&gt; >, bool, long,
double, std::vector&lt;std::__cxx11::basic_string&lt;char, std::char_traits&lt;char&gt;, std::allocator&lt;char&gt; >, std::allocator&lt;std::__cxx11::basic_string&lt;char, std::char_t
raits&lt;char&gt;, std::allocator&lt;char&gt; > > >, std::vector&lt;long, std::allocator&lt;long&gt; >, std::vector&lt;double, std::allocator&lt;double&gt; > > > > > const&)+0xcb03) [0x55562
43df333]
4: (OSDMonitor::prepare_command(boost::intrusive_ptr&lt;MonOpRequest&gt;)+0x45f) [0x5556243f03af]
5: (OSDMonitor::prepare_update(boost::intrusive_ptr&lt;MonOpRequest&gt;)+0x162) [0x5556243ff552]
...

Neither downgrading monitor nor OSDs helped - downgraded entities were unable to startup due to data layout changes done by Octopus..

Hence getting a sort of deadlock which finally has been resolved by using custom Quincy build which omits the above assertion at OSDMonitor.cc

Evidently this was primarily a user fault to not upgrade required-osd-release but Ceph definitely needs more friendly means to resolve/avoid the issue.

History

#1 Updated by Igor Fedotov 2 months ago

  • Status changed from New to In Progress
  • Assignee set to Igor Fedotov
  • Backport set to quincy, pacific

#2 Updated by Igor Fedotov 2 months ago

  • Pull request ID set to 49199

#3 Updated by Radoslaw Zarzynski about 2 months ago

Hi Igor! What was the intermediary version during the upgrade? We merged https://github.com/ceph/ceph/pull/44090 but not sure it's presence in very old versions.

#4 Updated by Igor Fedotov about 2 months ago

Radoslaw Zarzynski wrote:

Hi Igor! What was the intermediary version during the upgrade? We merged https://github.com/ceph/ceph/pull/44090 but not sure it's presence in very old versions.

Hi Radek,
unfortunately it's hard to say at the moment. Highly likely they missed the patch or the warning itself (not sure whether it's possible to proceed with an upgrade when it's raised...)

Anyway my patch attempts to solve that from a different perspective - enabling raising require_osd_release instead of assertion and resulting deadlock..

Also available in: Atom PDF