Actions
Bug #52062
closedcephfs-mirror: terminating a mirror daemon can cause a crash at times
Status:
Resolved
Priority:
High
Assignee:
Category:
Correctness/Safety
Target version:
% Done:
0%
Source:
Development
Tags:
Backport:
pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
Seen in this teuthology run which thrashes the mirror daemon for active/active HA test: https://pulpito.ceph.com/vshankar-2021-08-05_02:19:15-fs-wip-cephfs-mirror-ha-active-active-20210802-054956-distro-basic-smithi/
2021-08-05T02:39:35.991+0000 7fddebb3c700 -1 *** Caught signal (Segmentation fault) ** in thread 7fddebb3c700 thread_name:msgr-worker-1 ceph version 17.0.0-6593-gede67e63 (ede67e630d11e5f6758fa1e18b166b29d499c421) quincy (dev) 1: /lib64/libpthread.so.0(+0x12b20) [0x7fddf032db20] 2: (ProtocolV2::send_message(Message*)+0xa1) [0x7fddf14e0f37] 3: (AsyncConnection::send_message(Message*)+0x813) [0x7fddf14ad0db] 4: (Connection::send_message2(boost::intrusive_ptr<Message>)+0x1e) [0x7fddf14ade22] 5: (MonClient::_send_mon_message(boost::intrusive_ptr<Message>)+0x8a) [0x7fddf1588568] 6: (MonClient::_finish_hunting(int)+0x5f9) [0x7fddf1593eb5] 7: (MonClient::handle_auth_done(Connection*, AuthConnectionMeta*, unsigned long, unsigned int, ceph::buffer::v15_2_0::list const&, CryptoKey*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*)+0x2ce) [0x7fddf1595400] 8: (ProtocolV2::handle_auth_done(ceph::buffer::v15_2_0::list&)+0x4b2) [0x7fddf14f15b4] 9: (ProtocolV2::handle_frame_payload()+0x1f6) [0x7fddf1500130] 10: (ProtocolV2::handle_read_frame_dispatch()+0x179) [0x7fddf15003cf] 11: (ProtocolV2::_handle_read_frame_epilogue_main()+0xc2) [0x7fddf15005e8] 12: (ProtocolV2::_handle_read_frame_segment()+0xa6) [0x7fddf1500938] 13: (ProtocolV2::handle_read_frame_segment(std::unique_ptr<ceph::buffer::v15_2_0::ptr_node, ceph::buffer::v15_2_0::ptr_node::disposer>&&, int)+0xc7) [0x7fddf1501f13] 14: (CtRxNode<ProtocolV2>::call(ProtocolV2*) const+0x31) [0x7fddf1502621] 15: (ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x3b) [0x7fddf14e7eaf] 16: /usr/lib64/ceph/libceph-common.so.2(+0x65d495) [0x7fddf14e8495] 17: (std::function<void (char*, long)>::operator()(char*, long) const+0x23) [0x7fddf14ae307] 18: (AsyncConnection::process()+0xeb5) [0x7fddf14ac099] 19: (C_handle_read::do_request(unsigned long)+0x16) [0x7fddf14aee24] 20: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x594) [0x7fddf150a7a0] 21: /usr/lib64/ceph/libceph-common.so.2(+0x68a4e7) [0x7fddf15154e7] 22: (std::function<void ()>::operator()() const+0x12) [0x7fddf1513ba6] 23: (std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::function<void ()> > > >::_M_run()+0x11) [0x7fddf1513bc1] 24: /lib64/libstdc++.so.6(+0xc2ba3) [0x7fddef562ba3] 25: /lib64/libpthread.so.0(+0x814a) [0x7fddf032314a] 26: clone() NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Basically, there is a race when the mirror daemon is shutting down and the mirror daemon receiving a fs_map update::
2021-08-05T02:39:05.977+0000 7fddf302af80 10 cephfs::mirror::Mirror run: canceling timer task=0x55e35491d4e0 2021-08-05T02:39:05.977+0000 7fddf302af80 10 cephfs::mirror::Mirror run: trying to shutdown filesystem={fscid=2, fs_name=cephfs} 2021-08-05T02:39:05.977+0000 7fddf302af80 20 cephfs::mirror::FSMirror shutdown 2021-08-05T02:39:05.977+0000 7fddf302af80 20 cephfs::mirror::FSMirror shutdown_peer_replayers 2021-08-05T02:39:05.977+0000 7fddf302af80 5 cephfs::mirror::FSMirror shutdown_peer_replayers: shutting down replayer for peer={uuid=3aeddb3f-3d31-4db5-9da0-aeed11538b3c, remote_cluster={client_name=client.mirror_remote, cluster_name=ceph, fs_name=backup_fs}} 2021-08-05T02:39:05.977+0000 7fddf302af80 20 cephfs::mirror::PeerReplayer(3aeddb3f-3d31-4db5-9da0-aeed11538b3c) shutdown 2021-08-05T02:39:05.977+0000 7fddcbafc700 5 cephfs::mirror::PeerReplayer(3aeddb3f-3d31-4db5-9da0-aeed11538b3c) run: exiting 2021-08-05T02:39:05.977+0000 7fddcb2fb700 5 cephfs::mirror::PeerReplayer(3aeddb3f-3d31-4db5-9da0-aeed11538b3c) run: exiting 2021-08-05T02:39:05.977+0000 7fddcc2fd700 5 cephfs::mirror::PeerReplayer(3aeddb3f-3d31-4db5-9da0-aeed11538b3c) run: exiting 2021-08-05T02:39:05.980+0000 7fddf302af80 20 cephfs::mirror::FSMirror shutdown_mirror_watcher 2021-08-05T02:39:05.980+0000 7fddf302af80 20 cephfs::mirror::MirrorWatcher shutdown 2021-08-05T02:39:05.980+0000 7fddf302af80 20 cephfs::mirror::MirrorWatcher unregister_watcher 2021-08-05T02:39:05.980+0000 7fddf302af80 20 cephfs::mirror::Watcher unregister_watch 2021-08-05T02:39:05.982+0000 7fdde9337700 20 cephfs::mirror::MirrorWatcher handle_unregister_watcher: r=0 2021-08-05T02:39:05.982+0000 7fdde9337700 20 cephfs::mirror::FSMirror handle_shutdown_mirror_watcher: r=0 2021-08-05T02:39:05.982+0000 7fdde9337700 20 cephfs::mirror::FSMirror shutdown_instance_watcher 2021-08-05T02:39:05.982+0000 7fdde9337700 20 cephfs::mirror::InstanceWatcher shutdown 2021-08-05T02:39:05.982+0000 7fdde9337700 20 cephfs::mirror::InstanceWatcher unregister_watcher 2021-08-05T02:39:05.982+0000 7fdde9337700 20 cephfs::mirror::Watcher unregister_watch 2021-08-05T02:39:05.983+0000 7fdde9337700 20 cephfs::mirror::InstanceWatcher handle_unregister_watcher: r=0 2021-08-05T02:39:05.983+0000 7fdde9337700 20 cephfs::mirror::InstanceWatcher remove_instance 2021-08-05T02:39:05.985+0000 7fdde0325700 20 cephfs::mirror::InstanceWatcher handle_remove_instance: r=0 2021-08-05T02:39:05.985+0000 7fdde9337700 20 cephfs::mirror::FSMirror handle_shutdown_instance_watcher: r=0 2021-08-05T02:39:05.985+0000 7fdde9337700 20 cephfs::mirror::FSMirror cleanup 2021-08-05T02:39:06.435+0000 7fddebb3c700 20 cephfs::mirror::ClusterWatcher handle_fsmap 2021-08-05T02:39:06.435+0000 7fddebb3c700 5 cephfs::mirror::ClusterWatcher handle_fsmap: mirroring enabled=[], mirroring_disabled=[{fscid=2, fs_name=cephfs}] 2021-08-05T02:39:06.435+0000 7fddebb3c700 10 cephfs::mirror::ServiceDaemon: 0x55e35494c4e0 remove_filesystem: fscid=2 2021-08-05T02:39:06.435+0000 7fddebb3c700 10 cephfs::mirror::ServiceDaemon: 0x55e35494c4e0 schedule_update_status 2021-08-05T02:39:06.435+0000 7fddebb3c700 10 cephfs::mirror::Mirror mirroring_disabled: filesystem={fscid=2, fs_name=cephfs} 2021-08-05T02:39:07.435+0000 7fdde4b2e700 20 cephfs::mirror::ServiceDaemon: 0x55e35494c4e0 update_status: 0 filesystem(s) 2021-08-05T02:39:35.986+0000 7fddf302af80 10 cephfs::mirror::Mirror run: shutdown filesystem={fscid=2, fs_name=cephfs}, r=0 2021-08-05T02:39:35.986+0000 7fddf302af80 20 cephfs::mirror::FSMirror ~FSMirror 2021-08-05T02:39:35.986+0000 7fddf302af80 10 cephfs::mirror::Mirror ~Mirror 2021-08-05T02:39:35.986+0000 7fddebb3c700 5 cephfs::mirror::Mirror mirroring_disabledshutting down 2021-08-05T02:39:35.986+0000 7fddebb3c700 5 cephfs::mirror::ClusterWatcher handle_fsmap: peers added={}, peers removed={} 2021-08-05T02:39:35.986+0000 7fddf302af80 10 cephfs::mirror::ServiceDaemon: 0x55e35494c4e0 ~ServiceDaemon
Actions