Bug #62355
closedcephfs-mirror: do not run concurrent C_RestartMirroring context
0%
Description
This was majorly discussed in tracker https://tracker.ceph.com/issues/62072 and PR https://github.com/ceph/ceph/pull/52524, after digging deep the issue was much more than anticipated and kudos to venky for figuring it out and pushing the required patch
Quick summary:
this assertion fails:
ceph_assert(mirror_action.action_in_progress);
It's a bit hard to reproduce and involves performing some manual steps + running a qa test case using vstart_runner:
1) ./bin/cephfs-mirror --id mirror --cluster ceph -f --debug_cephfs-mirror=20
2) ./bin/ceph fs volume create a ; ./bin/ceph fs volume create b ; ./bin/ceph fs volume create c ; ./bin/ceph fs volume create d ; ./bin/ceph fs volume create e
3) ./bin/ceph mgr module enable mirroring
4) ./bin/ceph fs snapshot mirror enable a ; ./bin/ceph fs snapshot mirror enable b; ./bin/ceph fs snapshot mirror enable c; ./bin/ceph fs snapshot mirror enable d; ./bin/ceph fs snapshot mirror enable e
5) ./bin/ceph fs snapshot mirror peer_add a client.mirror_remote@ceph b; ./bin/ceph fs snapshot mirror peer_add c client.mirror_remote@ceph d
6) source ../../venv/bin/activate ; python ../qa/tasks/vstart_runner.py tasks.cephfs.test_mirroring.TestMirroring
and generates a back trace:
In function 'void cephfs::mirror::Mirror::handle_enable_mirroring(const cephfs::mirror::Filesystem&, const Peers&, int)' thread 7f48d02e86c0 time 2023-07-18T16:01:47.895508+0530
/home/dparmar/CephRepo0/ceph/src/tools/cephfs_mirror/Mirror.cc: 322: FAILED ceph_assert(mirror_action.action_in_progress)
ceph version 18.0.0-4983-gfb339521a7f (fb339521a7f6895ed891b5b56de641e8085e0cab) reef (dev)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x125) [0x7f48d59acb0f]
2: (ceph::register_assert_context(ceph::common::CephContext*)+0) [0x7f48d59acd16]
3: (cephfs::mirror::Mirror::handle_enable_mirroring(cephfs::mirror::Filesystem const&, std::set<Peer, std::less<Peer>, std::allocator<Peer> > const&, int)+0x168) [0x55a621b6e6fa]
4: (cephfs::mirror::Mirror::C_RestartMirroring::handle_enable_mirroring(int)+0x17) [0x55a621b770bd]
5: (cephfs::mirror::C_CallbackAdapter<cephfs::mirror::Mirror::C_RestartMirroring, &cephfs::mirror::Mirror::C_RestartMirroring::handle_enable_mirroring>::finish(int)+0xd) [0x55a621b770d5]
6: (Context::complete(int)+0x9) [0x55a621b71317]
7: (ContextWQ::process(Context*)+0x69) [0x55a621b7550f]
8: (ThreadPool::PointerWQ<Context>::_void_process(void*, ThreadPool::TPHandle&)+0xa) [0x55a621b71496]
9: (ThreadPool::worker(ThreadPool::WorkThread*)+0x5a3) [0x7f48d59a0379]
10: (ThreadPool::WorkThread::entry()+0x11) [0x7f48d59a2af3]
11: (Thread::entry_wrapper()+0x3f) [0x7f48d5990229]
12: (Thread::_entry_func(void*)+0x9) [0x7f48d5990241]
13: /lib64/libc.so.6(+0x8b12d) [0x7f48d46ae12d]
14: /lib64/libc.so.6(+0x10cbc0) [0x7f48d472fbc0]
this happens when there are failures in mirror daemon, in this case the class CephFSTestCase's delete_all_filesystems() is executed while running qa test case in 6th step; it is too rude for the filesystems and mirror daemon fails. This triggers concurrent C_RestartMirroring contexts which modify mirror_action.action_in_progress on the fly; and leads the assertion to fail and thus causes the crash. This is now fixed in the PR venky raised where running concurrent C_RestartMirroring contexts is disallowed.
Updated by Dhairya Parmar 9 months ago
- Status changed from Fix Under Review to Closed
- Assignee deleted (
Venky Shankar)
closing since added the summary to the existing tracker(which wasnt meant to address C_RestartMirroring contexts but since most of the discussions took place there, makes no sense to create a new one, will modify that tracker and assign to venky)