Bug #63861
open[stretch_mode] All MONs crash if we allow user to input same Datacenter name and host name.
0%
Description
When mons are added to stretch cluster manually, if a wrong mon daemon is added with additional location attributes ( site name same as one of the host names ), all the mons in the cluster crashed multiple times and the cluster is non-responsive
Crash message :
{
"crash_id": "2023-07-24T09:33:26.393629Z_b09fde97-3607-40f6-9641-218b53336fe3",
"timestamp": "2023-07-24T09:33:26.393629Z",
"process_name": "ceph-mon",
"entity_name": "mon.osd-1",
"ceph_version": "16.2.10-187.el8cp",
"utsname_hostname": "osd-1",
"utsname_sysname": "Linux",
"utsname_release": "4.18.0-477.15.1.el8_8.x86_64",
"utsname_version": "#1 SMP Fri Jun 2 08:27:19 EDT 2023",
"utsname_machine": "x86_64",
"os_name": "Red Hat Enterprise Linux",
"os_id": "rhel",
"os_version_id": "8.8",
"os_version": "8.8 (Ootpa)",
"assert_condition": "osdmap.crush->name_exists(bucket_name)",
"assert_func": "bool OSDMonitor::check_for_dead_crush_zones(const std::map<std::__cxx11::basic_string<char>, std::set<std::__cxx11::basic_string<char> > >&, std::set<int>*, std::set<std::__cxx11::basic_string<char> >)",
"assert_file": "/builddir/build/BUILD/ceph-16.2.10/src/mon/OSDMonitor.cc",
"assert_line": 14689,
"assert_thread_name": "ms_dispatch",
"assert_msg": "/builddir/build/BUILD/ceph-16.2.10/src/mon/OSDMonitor.cc: In function 'bool OSDMonitor::check_for_dead_crush_zones(const std::map<std::__cxx11::basic_string<char>, std::set<std::__cxx11::basic_string<char> > >&, std::set<int>, std::set<std::__cxx11::basic_string<char> >)' thread 7f338e775700 time 2023-07-24T09:33:26.365352+0000\n/builddir/build/BUILD/ceph-16.2.10/src/mon/OSDMonitor.cc: 14689: FAILED ceph_assert(osdmap.crush->name_exists(bucket_name))\n",
"backtrace": [
"/lib64/libpthread.so.0(+0x12cf0) [0x7f339a010cf0]",
"gsignal()",
"abort()",
"(ceph::__ceph_assert_fail(char const, char const*, int, char const*)+0x1a9) [0x7f339c2da585]",
"/usr/lib64/ceph/libceph-common.so.2(+0x27974e) [0x7f339c2da74e]",
"(OSDMonitor::check_for_dead_crush_zones(std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > > > const&, std::set<int, std::less<int>, std::allocator<int> >, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >)+0x14fc) [0x558bc4beb40c]",
"(Monitor::maybe_go_degraded_stretch_mode()+0x20e) [0x558bc4a5e48e]",
"(Context::complete(int)+0xd) [0x558bc4a7aced]",
"(void finish_contexts<std::__cxx11::list<Context*, std::allocator<Context*> > >(ceph::common::CephContext*, std::__cxx11::list<Context*, std::allocator<Context*> >&, int)+0xa5) [0x558bc4aa76d5]",
"(Paxos::finish_round()+0x27b) [0x558bc4b8b3cb]",
"(Paxos::handle_last(boost::intrusive_ptr<MonOpRequest>)+0xe29) [0x558bc4b8c689]",
"(Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x417) [0x558bc4b8d327]",
"(Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x1324) [0x558bc4a78524]",
"(Monitor::_ms_dispatch(Message*)+0x670) [0x558bc4a78fb0]",
"(Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x5c) [0x558bc4aa839c]",
"(DispatchQueue::entry()+0x126a) [0x7f339c5238da]",
"(DispatchQueue::DispatchThread::entry()+0x11) [0x7f339c5d6e21]",
"/lib64/libpthread.so.0(+0x81ca) [0x7f339a0061ca]",
"clone()"
]
}
How reproducible:
1/1
Steps to Reproduce:
1. Deploy a RHCS cluster, add all hosts.
- ceph mon dump
epoch 12
fsid 5#########
last_changed 2023-07-21T08:53:34.795708+0000
created 2023-07-21T08:42:01.326786+0000
min_mon_release 16 (pacific)
election_strategy: 3
stretch_mode_enabled 1
tiebreaker_mon arbiter
disallowed_leaders arbiter
0: [v2:10.1.160.138:3300/0,v1:10.1.160.138:6789/0] mon.osd-0; crush_location {datacenter=zone-b}
1: [v2:10.1.160.57:3300/0,v1:10.1.160.57:6789/0] mon.osd-1; crush_location {datacenter=zone-b}
2: [v2:10.1.160.134:3300/0,v1:10.1.160.134:6789/0] mon.osd-3; crush_location {datacenter=zone-c}
3: [v2:10.1.160.137:3300/0,v1:10.1.160.137:6789/0] mon.osd-4; crush_location {datacenter=zone-c}
4: [v2:10.1.160.133:3300/0,v1:10.1.160.133:6789/0] mon.arbiter; crush_location {datacenter=zone-a}
dumped monmap epoch 12
- ceph mon add ceph-pdhiran-4xtsy5-node4 10.0.210.59 datacenter=Arbiter
adding mon.ceph-pdhiran-4xtsy5-node4 at [v2:10.0.210.59:3300/0,v1:10.0.210.59:6789/0]
- ceph mon dump
epoch 13
fsid 596c70b0-27a2-11ee-8aaa-0050568f090f
last_changed 2023-07-24T09:33:16.262971+0000
created 2023-07-21T08:42:01.326786+0000
min_mon_release 16 (pacific)
election_strategy: 3
stretch_mode_enabled 1
tiebreaker_mon arbiter
disallowed_leaders arbiter
0: [v2:10.1.160.138:3300/0,v1:10.1.160.138:6789/0] mon.osd-0; crush_location {datacenter=zone-b}
1: [v2:10.1.160.57:3300/0,v1:10.1.160.57:6789/0] mon.osd-1; crush_location {datacenter=zone-b}
2: [v2:10.1.160.134:3300/0,v1:10.1.160.134:6789/0] mon.osd-3; crush_location {datacenter=zone-c}
3: [v2:10.1.160.137:3300/0,v1:10.1.160.137:6789/0] mon.osd-4; crush_location {datacenter=zone-c}
4: [v2:10.1.160.133:3300/0,v1:10.1.160.133:6789/0] mon.arbiter; crush_location {datacenter=zone-a}
5: [v2:10.0.210.59:3300/0,v1:10.0.210.59:6789/0] mon.ceph-pdhiran-4xtsy5-node4; crush_location {datacenter=Arbiter}
dumped monmap epoch 13
4. Observe that slowly all the mon daemons begin to crash and all go down one by one.
5. Commands are failing with error :- ceph mon rm ceph-pdhiran-4xtsy5-node4
2023-07-24T15:04:17.318+0530 7f5f8759e700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
Actual results:
Unable to remove the additional mon and the Cluster is down and not usable
Expected results:
Should be able to remove the additional mon, NO Crashes to be observed upon addition of mon with wrong location attribute.
Updated by Kamoltat (Junior) Sirivadhna 4 months ago
- Status changed from New to Fix Under Review
- Pull request ID set to 55103