Project

General

Profile

Actions

Bug #62355

closed

cephfs-mirror: do not run concurrent C_RestartMirroring context

Added by Dhairya Parmar 9 months ago. Updated 9 months ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
cephfs-mirror
Labels (FS):
crash
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This was majorly discussed in tracker https://tracker.ceph.com/issues/62072 and PR https://github.com/ceph/ceph/pull/52524, after digging deep the issue was much more than anticipated and kudos to venky for figuring it out and pushing the required patch

Quick summary:

this assertion fails:

ceph_assert(mirror_action.action_in_progress);

It's a bit hard to reproduce and involves performing some manual steps + running a qa test case using vstart_runner:

1) ./bin/cephfs-mirror --id mirror --cluster ceph -f --debug_cephfs-mirror=20
2) ./bin/ceph fs volume create a ; ./bin/ceph fs volume create b ; ./bin/ceph fs volume create c ; ./bin/ceph fs volume create d ; ./bin/ceph fs volume create e
3) ./bin/ceph mgr module enable mirroring 
4) ./bin/ceph fs snapshot mirror enable a ; ./bin/ceph fs snapshot mirror enable b; ./bin/ceph fs snapshot mirror enable c; ./bin/ceph fs snapshot mirror enable d; ./bin/ceph fs snapshot mirror enable e 
5) ./bin/ceph fs snapshot mirror peer_add a client.mirror_remote@ceph b; ./bin/ceph fs snapshot mirror peer_add c client.mirror_remote@ceph d
6) source ../../venv/bin/activate ; python ../qa/tasks/vstart_runner.py tasks.cephfs.test_mirroring.TestMirroring

and generates a back trace:

In function 'void cephfs::mirror::Mirror::handle_enable_mirroring(const cephfs::mirror::Filesystem&, const Peers&, int)' thread 7f48d02e86c0 time 2023-07-18T16:01:47.895508+0530
/home/dparmar/CephRepo0/ceph/src/tools/cephfs_mirror/Mirror.cc: 322: FAILED ceph_assert(mirror_action.action_in_progress)

 ceph version 18.0.0-4983-gfb339521a7f (fb339521a7f6895ed891b5b56de641e8085e0cab) reef (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x125) [0x7f48d59acb0f]
 2: (ceph::register_assert_context(ceph::common::CephContext*)+0) [0x7f48d59acd16]
 3: (cephfs::mirror::Mirror::handle_enable_mirroring(cephfs::mirror::Filesystem const&, std::set<Peer, std::less<Peer>, std::allocator<Peer> > const&, int)+0x168) [0x55a621b6e6fa]
 4: (cephfs::mirror::Mirror::C_RestartMirroring::handle_enable_mirroring(int)+0x17) [0x55a621b770bd]
 5: (cephfs::mirror::C_CallbackAdapter<cephfs::mirror::Mirror::C_RestartMirroring, &cephfs::mirror::Mirror::C_RestartMirroring::handle_enable_mirroring>::finish(int)+0xd) [0x55a621b770d5]
 6: (Context::complete(int)+0x9) [0x55a621b71317]
 7: (ContextWQ::process(Context*)+0x69) [0x55a621b7550f]
 8: (ThreadPool::PointerWQ<Context>::_void_process(void*, ThreadPool::TPHandle&)+0xa) [0x55a621b71496]
 9: (ThreadPool::worker(ThreadPool::WorkThread*)+0x5a3) [0x7f48d59a0379]
 10: (ThreadPool::WorkThread::entry()+0x11) [0x7f48d59a2af3]
 11: (Thread::entry_wrapper()+0x3f) [0x7f48d5990229]
 12: (Thread::_entry_func(void*)+0x9) [0x7f48d5990241]
 13: /lib64/libc.so.6(+0x8b12d) [0x7f48d46ae12d]
 14: /lib64/libc.so.6(+0x10cbc0) [0x7f48d472fbc0]

this happens when there are failures in mirror daemon, in this case the class CephFSTestCase's delete_all_filesystems() is executed while running qa test case in 6th step; it is too rude for the filesystems and mirror daemon fails. This triggers concurrent C_RestartMirroring contexts which modify mirror_action.action_in_progress on the fly; and leads the assertion to fail and thus causes the crash. This is now fixed in the PR venky raised where running concurrent C_RestartMirroring contexts is disallowed.

Actions #1

Updated by Dhairya Parmar 9 months ago

  • Status changed from Fix Under Review to Closed
  • Assignee deleted (Venky Shankar)

closing since added the summary to the existing tracker(which wasnt meant to address C_RestartMirroring contexts but since most of the discussions took place there, makes no sense to create a new one, will modify that tracker and assign to venky)

Actions

Also available in: Atom PDF