Project

General

Profile

Actions

Bug #52062

closed

cephfs-mirror: terminating a mirror daemon can cause a crash at times

Added by Venky Shankar over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Development
Tags:
Backport:
pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Seen in this teuthology run which thrashes the mirror daemon for active/active HA test: https://pulpito.ceph.com/vshankar-2021-08-05_02:19:15-fs-wip-cephfs-mirror-ha-active-active-20210802-054956-distro-basic-smithi/

2021-08-05T02:39:35.991+0000 7fddebb3c700 -1 *** Caught signal (Segmentation fault) **
 in thread 7fddebb3c700 thread_name:msgr-worker-1

 ceph version 17.0.0-6593-gede67e63 (ede67e630d11e5f6758fa1e18b166b29d499c421) quincy (dev)
 1: /lib64/libpthread.so.0(+0x12b20) [0x7fddf032db20]
 2: (ProtocolV2::send_message(Message*)+0xa1) [0x7fddf14e0f37]
 3: (AsyncConnection::send_message(Message*)+0x813) [0x7fddf14ad0db]
 4: (Connection::send_message2(boost::intrusive_ptr<Message>)+0x1e) [0x7fddf14ade22]
 5: (MonClient::_send_mon_message(boost::intrusive_ptr<Message>)+0x8a) [0x7fddf1588568]
 6: (MonClient::_finish_hunting(int)+0x5f9) [0x7fddf1593eb5]
 7: (MonClient::handle_auth_done(Connection*, AuthConnectionMeta*, unsigned long, unsigned int, ceph::buffer::v15_2_0::list const&, CryptoKey*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*)+0x2ce) [0x7fddf1595400]
 8: (ProtocolV2::handle_auth_done(ceph::buffer::v15_2_0::list&)+0x4b2) [0x7fddf14f15b4]
 9: (ProtocolV2::handle_frame_payload()+0x1f6) [0x7fddf1500130]                                                                                                                                                                                                                                                            
 10: (ProtocolV2::handle_read_frame_dispatch()+0x179) [0x7fddf15003cf]                                                                                                                                                                                                                                                     
 11: (ProtocolV2::_handle_read_frame_epilogue_main()+0xc2) [0x7fddf15005e8]                                                                                                                                                                                                                                                
 12: (ProtocolV2::_handle_read_frame_segment()+0xa6) [0x7fddf1500938]                                                                                                                                                                                                                                                      
 13: (ProtocolV2::handle_read_frame_segment(std::unique_ptr<ceph::buffer::v15_2_0::ptr_node, ceph::buffer::v15_2_0::ptr_node::disposer>&&, int)+0xc7) [0x7fddf1501f13]                                                                                                                                                     
 14: (CtRxNode<ProtocolV2>::call(ProtocolV2*) const+0x31) [0x7fddf1502621]                                                                                                                                                                                                                                                 
 15: (ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x3b) [0x7fddf14e7eaf]                                                                                                                                                                                                                                                 
 16: /usr/lib64/ceph/libceph-common.so.2(+0x65d495) [0x7fddf14e8495]                                                                                                                                                                                                                                                       
 17: (std::function<void (char*, long)>::operator()(char*, long) const+0x23) [0x7fddf14ae307]                                                                                                                                                                                                                              
 18: (AsyncConnection::process()+0xeb5) [0x7fddf14ac099]                                                                                                                                                                                                                                                                   
 19: (C_handle_read::do_request(unsigned long)+0x16) [0x7fddf14aee24]                                                                                                                                                                                                                                                      
 20: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x594) [0x7fddf150a7a0]                                                                                                                                                                               
 21: /usr/lib64/ceph/libceph-common.so.2(+0x68a4e7) [0x7fddf15154e7]                                                                                                                                                                                                                                                       
 22: (std::function<void ()>::operator()() const+0x12) [0x7fddf1513ba6]                                                                                                                                                                                                                                                    
 23: (std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::function<void ()> > > >::_M_run()+0x11) [0x7fddf1513bc1]                                                                                                                                                                                              
 24: /lib64/libstdc++.so.6(+0xc2ba3) [0x7fddef562ba3]                                                                                                                                                                                                                                                                      
 25: /lib64/libpthread.so.0(+0x814a) [0x7fddf032314a]                                                                                                                                                                                                                                                                      
 26: clone()                                                                                                                                                                                                                                                                                                               
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.                                  

Basically, there is a race when the mirror daemon is shutting down and the mirror daemon receiving a fs_map update::

2021-08-05T02:39:05.977+0000 7fddf302af80 10 cephfs::mirror::Mirror run: canceling timer task=0x55e35491d4e0
2021-08-05T02:39:05.977+0000 7fddf302af80 10 cephfs::mirror::Mirror run: trying to shutdown filesystem={fscid=2, fs_name=cephfs}
2021-08-05T02:39:05.977+0000 7fddf302af80 20 cephfs::mirror::FSMirror shutdown
2021-08-05T02:39:05.977+0000 7fddf302af80 20 cephfs::mirror::FSMirror shutdown_peer_replayers
2021-08-05T02:39:05.977+0000 7fddf302af80  5 cephfs::mirror::FSMirror shutdown_peer_replayers: shutting down replayer for peer={uuid=3aeddb3f-3d31-4db5-9da0-aeed11538b3c, remote_cluster={client_name=client.mirror_remote, cluster_name=ceph, fs_name=backup_fs}}
2021-08-05T02:39:05.977+0000 7fddf302af80 20 cephfs::mirror::PeerReplayer(3aeddb3f-3d31-4db5-9da0-aeed11538b3c) shutdown
2021-08-05T02:39:05.977+0000 7fddcbafc700  5 cephfs::mirror::PeerReplayer(3aeddb3f-3d31-4db5-9da0-aeed11538b3c) run: exiting
2021-08-05T02:39:05.977+0000 7fddcb2fb700  5 cephfs::mirror::PeerReplayer(3aeddb3f-3d31-4db5-9da0-aeed11538b3c) run: exiting
2021-08-05T02:39:05.977+0000 7fddcc2fd700  5 cephfs::mirror::PeerReplayer(3aeddb3f-3d31-4db5-9da0-aeed11538b3c) run: exiting
2021-08-05T02:39:05.980+0000 7fddf302af80 20 cephfs::mirror::FSMirror shutdown_mirror_watcher
2021-08-05T02:39:05.980+0000 7fddf302af80 20 cephfs::mirror::MirrorWatcher shutdown
2021-08-05T02:39:05.980+0000 7fddf302af80 20 cephfs::mirror::MirrorWatcher unregister_watcher
2021-08-05T02:39:05.980+0000 7fddf302af80 20 cephfs::mirror::Watcher unregister_watch
2021-08-05T02:39:05.982+0000 7fdde9337700 20 cephfs::mirror::MirrorWatcher handle_unregister_watcher: r=0
2021-08-05T02:39:05.982+0000 7fdde9337700 20 cephfs::mirror::FSMirror handle_shutdown_mirror_watcher: r=0
2021-08-05T02:39:05.982+0000 7fdde9337700 20 cephfs::mirror::FSMirror shutdown_instance_watcher
2021-08-05T02:39:05.982+0000 7fdde9337700 20 cephfs::mirror::InstanceWatcher shutdown
2021-08-05T02:39:05.982+0000 7fdde9337700 20 cephfs::mirror::InstanceWatcher unregister_watcher
2021-08-05T02:39:05.982+0000 7fdde9337700 20 cephfs::mirror::Watcher unregister_watch
2021-08-05T02:39:05.983+0000 7fdde9337700 20 cephfs::mirror::InstanceWatcher handle_unregister_watcher: r=0
2021-08-05T02:39:05.983+0000 7fdde9337700 20 cephfs::mirror::InstanceWatcher remove_instance
2021-08-05T02:39:05.985+0000 7fdde0325700 20 cephfs::mirror::InstanceWatcher handle_remove_instance: r=0
2021-08-05T02:39:05.985+0000 7fdde9337700 20 cephfs::mirror::FSMirror handle_shutdown_instance_watcher: r=0
2021-08-05T02:39:05.985+0000 7fdde9337700 20 cephfs::mirror::FSMirror cleanup
2021-08-05T02:39:06.435+0000 7fddebb3c700 20 cephfs::mirror::ClusterWatcher handle_fsmap
2021-08-05T02:39:06.435+0000 7fddebb3c700  5 cephfs::mirror::ClusterWatcher handle_fsmap: mirroring enabled=[], mirroring_disabled=[{fscid=2, fs_name=cephfs}]
2021-08-05T02:39:06.435+0000 7fddebb3c700 10 cephfs::mirror::ServiceDaemon: 0x55e35494c4e0 remove_filesystem: fscid=2
2021-08-05T02:39:06.435+0000 7fddebb3c700 10 cephfs::mirror::ServiceDaemon: 0x55e35494c4e0 schedule_update_status
2021-08-05T02:39:06.435+0000 7fddebb3c700 10 cephfs::mirror::Mirror mirroring_disabled: filesystem={fscid=2, fs_name=cephfs}
2021-08-05T02:39:07.435+0000 7fdde4b2e700 20 cephfs::mirror::ServiceDaemon: 0x55e35494c4e0 update_status: 0 filesystem(s)
2021-08-05T02:39:35.986+0000 7fddf302af80 10 cephfs::mirror::Mirror run: shutdown filesystem={fscid=2, fs_name=cephfs}, r=0
2021-08-05T02:39:35.986+0000 7fddf302af80 20 cephfs::mirror::FSMirror ~FSMirror
2021-08-05T02:39:35.986+0000 7fddf302af80 10 cephfs::mirror::Mirror ~Mirror
2021-08-05T02:39:35.986+0000 7fddebb3c700  5 cephfs::mirror::Mirror mirroring_disabledshutting down
2021-08-05T02:39:35.986+0000 7fddebb3c700  5 cephfs::mirror::ClusterWatcher handle_fsmap: peers added={}, peers removed={}
2021-08-05T02:39:35.986+0000 7fddf302af80 10 cephfs::mirror::ServiceDaemon: 0x55e35494c4e0 ~ServiceDaemon

Related issues 1 (0 open1 closed)

Copied to CephFS - Backport #52444: pacific: cephfs-mirror: terminating a mirror daemon can cause a crash at timesResolvedVenky ShankarActions
Actions #1

Updated by Patrick Donnelly over 2 years ago

  • Target version set to v17.0.0
Actions #2

Updated by Venky Shankar over 2 years ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 42751
Actions #3

Updated by Patrick Donnelly over 2 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Source set to Development
Actions #4

Updated by Backport Bot over 2 years ago

  • Copied to Backport #52444: pacific: cephfs-mirror: terminating a mirror daemon can cause a crash at times added
Actions #5

Updated by Loïc Dachary over 2 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF