Project

General

Profile

Actions

Bug #63861

open

[stretch_mode] All MONs crash if we allow user to input same Datacenter name and host name.

Added by Kamoltat (Junior) Sirivadhna 5 months ago. Updated 4 months ago.

Status:
Fix Under Review
Priority:
Normal
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When mons are added to stretch cluster manually, if a wrong mon daemon is added with additional location attributes ( site name same as one of the host names ), all the mons in the cluster crashed multiple times and the cluster is non-responsive

Crash message :

{
"crash_id": "2023-07-24T09:33:26.393629Z_b09fde97-3607-40f6-9641-218b53336fe3",
"timestamp": "2023-07-24T09:33:26.393629Z",
"process_name": "ceph-mon",
"entity_name": "mon.osd-1",
"ceph_version": "16.2.10-187.el8cp",
"utsname_hostname": "osd-1",
"utsname_sysname": "Linux",
"utsname_release": "4.18.0-477.15.1.el8_8.x86_64",
"utsname_version": "#1 SMP Fri Jun 2 08:27:19 EDT 2023",
"utsname_machine": "x86_64",
"os_name": "Red Hat Enterprise Linux",
"os_id": "rhel",
"os_version_id": "8.8",
"os_version": "8.8 (Ootpa)",
"assert_condition": "osdmap.crush->name_exists(bucket_name)",
"assert_func": "bool OSDMonitor::check_for_dead_crush_zones(const std::map<std::__cxx11::basic_string<char>, std::set<std::__cxx11::basic_string<char> > >&, std::set<int>*, std::set<std::__cxx11::basic_string<char> >)",
"assert_file": "/builddir/build/BUILD/ceph-16.2.10/src/mon/OSDMonitor.cc",
"assert_line": 14689,
"assert_thread_name": "ms_dispatch",
"assert_msg": "/builddir/build/BUILD/ceph-16.2.10/src/mon/OSDMonitor.cc: In function 'bool OSDMonitor::check_for_dead_crush_zones(const std::map<std::__cxx11::basic_string<char>, std::set<std::__cxx11::basic_string<char> > >&, std::set<int>
, std::set<std::__cxx11::basic_string<char> >)' thread 7f338e775700 time 2023-07-24T09:33:26.365352+0000\n/builddir/build/BUILD/ceph-16.2.10/src/mon/OSDMonitor.cc: 14689: FAILED ceph_assert(osdmap.crush->name_exists(bucket_name))\n",
"backtrace": [
"/lib64/libpthread.so.0(+0x12cf0) [0x7f339a010cf0]",
"gsignal()",
"abort()",
"(ceph::__ceph_assert_fail(char const
, char const*, int, char const*)+0x1a9) [0x7f339c2da585]",
"/usr/lib64/ceph/libceph-common.so.2(+0x27974e) [0x7f339c2da74e]",
"(OSDMonitor::check_for_dead_crush_zones(std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > > > const&, std::set<int, std::less<int>, std::allocator<int> >, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >)+0x14fc) [0x558bc4beb40c]",
"(Monitor::maybe_go_degraded_stretch_mode()+0x20e) [0x558bc4a5e48e]",
"(Context::complete(int)+0xd) [0x558bc4a7aced]",
"(void finish_contexts<std::__cxx11::list<Context*, std::allocator<Context*> > >(ceph::common::CephContext*, std::__cxx11::list<Context*, std::allocator<Context*> >&, int)+0xa5) [0x558bc4aa76d5]",
"(Paxos::finish_round()+0x27b) [0x558bc4b8b3cb]",
"(Paxos::handle_last(boost::intrusive_ptr<MonOpRequest>)+0xe29) [0x558bc4b8c689]",
"(Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x417) [0x558bc4b8d327]",
"(Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x1324) [0x558bc4a78524]",
"(Monitor::_ms_dispatch(Message*)+0x670) [0x558bc4a78fb0]",
"(Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x5c) [0x558bc4aa839c]",
"(DispatchQueue::entry()+0x126a) [0x7f339c5238da]",
"(DispatchQueue::DispatchThread::entry()+0x11) [0x7f339c5d6e21]",
"/lib64/libpthread.so.0(+0x81ca) [0x7f339a0061ca]",
"clone()"
]
}

How reproducible:
1/1

Steps to Reproduce:
1. Deploy a RHCS cluster, add all hosts.

2. Deploy additional mon, with location attribute other than what is provided for the other mons.
  1. ceph mon dump
    epoch 12
    fsid 5#########
    last_changed 2023-07-21T08:53:34.795708+0000
    created 2023-07-21T08:42:01.326786+0000
    min_mon_release 16 (pacific)
    election_strategy: 3
    stretch_mode_enabled 1
    tiebreaker_mon arbiter
    disallowed_leaders arbiter
    0: [v2:10.1.160.138:3300/0,v1:10.1.160.138:6789/0] mon.osd-0; crush_location {datacenter=zone-b}
    1: [v2:10.1.160.57:3300/0,v1:10.1.160.57:6789/0] mon.osd-1; crush_location {datacenter=zone-b}
    2: [v2:10.1.160.134:3300/0,v1:10.1.160.134:6789/0] mon.osd-3; crush_location {datacenter=zone-c}
    3: [v2:10.1.160.137:3300/0,v1:10.1.160.137:6789/0] mon.osd-4; crush_location {datacenter=zone-c}
    4: [v2:10.1.160.133:3300/0,v1:10.1.160.133:6789/0] mon.arbiter; crush_location {datacenter=zone-a}
    dumped monmap epoch 12
3. Deploy the additional mon daemon.
  1. ceph mon add ceph-pdhiran-4xtsy5-node4 10.0.210.59 datacenter=Arbiter
    adding mon.ceph-pdhiran-4xtsy5-node4 at [v2:10.0.210.59:3300/0,v1:10.0.210.59:6789/0]
  1. ceph mon dump
    epoch 13
    fsid 596c70b0-27a2-11ee-8aaa-0050568f090f
    last_changed 2023-07-24T09:33:16.262971+0000
    created 2023-07-21T08:42:01.326786+0000
    min_mon_release 16 (pacific)
    election_strategy: 3
    stretch_mode_enabled 1
    tiebreaker_mon arbiter
    disallowed_leaders arbiter
    0: [v2:10.1.160.138:3300/0,v1:10.1.160.138:6789/0] mon.osd-0; crush_location {datacenter=zone-b}
    1: [v2:10.1.160.57:3300/0,v1:10.1.160.57:6789/0] mon.osd-1; crush_location {datacenter=zone-b}
    2: [v2:10.1.160.134:3300/0,v1:10.1.160.134:6789/0] mon.osd-3; crush_location {datacenter=zone-c}
    3: [v2:10.1.160.137:3300/0,v1:10.1.160.137:6789/0] mon.osd-4; crush_location {datacenter=zone-c}
    4: [v2:10.1.160.133:3300/0,v1:10.1.160.133:6789/0] mon.arbiter; crush_location {datacenter=zone-a}
    5: [v2:10.0.210.59:3300/0,v1:10.0.210.59:6789/0] mon.ceph-pdhiran-4xtsy5-node4; crush_location {datacenter=Arbiter}
    dumped monmap epoch 13

4. Observe that slowly all the mon daemons begin to crash and all go down one by one.

5. Commands are failing with error :
  1. ceph mon rm ceph-pdhiran-4xtsy5-node4
    2023-07-24T15:04:17.318+0530 7f5f8759e700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]

Actual results:
Unable to remove the additional mon and the Cluster is down and not usable

Expected results:
Should be able to remove the additional mon, NO Crashes to be observed upon addition of mon with wrong location attribute.

Actions #1

Updated by Kamoltat (Junior) Sirivadhna 4 months ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 55103
Actions

Also available in: Atom PDF