Not upgraded nautilus mons crash if upgraded pacific mon updates fsmap
I have no idea if this needs to be fixed but at least the case looks worth reporting.
We faced the issue when upgrading the cluster from nautilus 14.2.22 to pacific 16.2.8.
After the leader mon had been upgraded, a mds server was accidentally stopped, and this caused the non-upgraded mons to crash on handling the fsmap update request:
2022-04-25 14:55:03.924 7f200400f700 -1 /home/abuild/rpmbuild/BUILD/ceph-14.2.22-445-ga68959d39a6/src/mds/FSMap.cc: In function 'void FSMap::sanity() const' thread 7f200400f700 time 2022-04- 25 14:55:03.923549 /home/abuild/rpmbuild/BUILD/ceph-14.2.22-445-ga68959d39a6/src/mds/FSMap.cc: 755: FAILED ceph_assert(fs->mds_map.compat.compare(compat) == 0) ceph version 14.2.22-445-ga68959d39a6 (a68959d39a67faec1a7ace55e8c4327accc4a38c) nautilus (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x7f20126efdb6] 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x7f20126eff91] 3: (FSMap::sanity() const+0xe0) [0x7f2012c0de20] 4: (MDSMonitor::update_from_paxos(bool*)+0x488) [0x5633f9880a98] 5: (PaxosService::refresh(bool*)+0x25a) [0x5633f97bd83a] 6: (Monitor::refresh_from_paxos(bool*)+0x10c) [0x5633f969ceac] 7: (Paxos::do_refresh()+0x4f) [0x5633f97acb8f] 8: (Paxos::handle_commit(boost::intrusive_ptr<MonOpRequest>)+0x132) [0x5633f97b21b2] 9: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x2db) [0x5633f97b7ecb] 10: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x1668) [0x5633f96d00b8] 11: (Monitor::_ms_dispatch(Message*)+0xa3a) [0x5633f96d0b5a] 12: (Monitor::ms_dispatch(Message*)+0x26) [0x5633f9701646] 13: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x26) [0x5633f96fe0b6] 14: (DispatchQueue::entry()+0x1279) [0x7f201291d379] 15: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f20129cda5d] 16: (()+0x8539) [0x7f201154c539] 17: (clone()+0x3f) [0x7f201071ccff]
It was rather unpleasant in our case because the upgraded mon was not able to make quorum and the cluster was inaccessible until the nautilus mons were upgraded manually.
#1 Updated by Mykola Golub 5 months ago
- Status changed from New to Won't Fix
I was just told there is a step in the upgrade documentation to set mon_mds_skip_sanity param before upgrade , which looks like to workarund this issue. So I am closing this ticket.