Bug #62072
closedcephfs-mirror: do not run concurrent C_RestartMirroring context
100%
Description
If cephfs-mirror daemon faces any issues connecting to the cluster or error accessing local pool or mounting fs then it means there should not be any mirroring action in progress. Asserting below code directly in Mirror::handle_enable_mirroring() would fail in such scenarios
ceph_assert(mirror_action.action_in_progress);
A good explanation by venky is that when file system is removed and the mirror daemon is still active on it, the watcher object gets deleted and at times, the mirror daemon would be blocked on a libcephfs call. Thus leading to issues
and would lead to backtraces like this:
In function 'void cephfs::mirror::Mirror::handle_enable_mirroring(const cephfs::mirror::Filesystem&, const Peers&, int)' thread 7f48d02e86c0 time 2023-07-18T16:01:47.895508+0530
/home/dparmar/CephRepo0/ceph/src/tools/cephfs_mirror/Mirror.cc: 322: FAILED ceph_assert(mirror_action.action_in_progress)
ceph version 18.0.0-4983-gfb339521a7f (fb339521a7f6895ed891b5b56de641e8085e0cab) reef (dev)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x125) [0x7f48d59acb0f]
2: (ceph::register_assert_context(ceph::common::CephContext*)+0) [0x7f48d59acd16]
3: (cephfs::mirror::Mirror::handle_enable_mirroring(cephfs::mirror::Filesystem const&, std::set<Peer, std::less<Peer>, std::allocator<Peer> > const&, int)+0x168) [0x55a621b6e6fa]
4: (cephfs::mirror::Mirror::C_RestartMirroring::handle_enable_mirroring(int)+0x17) [0x55a621b770bd]
5: (cephfs::mirror::C_CallbackAdapter<cephfs::mirror::Mirror::C_RestartMirroring, &cephfs::mirror::Mirror::C_RestartMirroring::handle_enable_mirroring>::finish(int)+0xd) [0x55a621b770d5]
6: (Context::complete(int)+0x9) [0x55a621b71317]
7: (ContextWQ::process(Context*)+0x69) [0x55a621b7550f]
8: (ThreadPool::PointerWQ<Context>::_void_process(void*, ThreadPool::TPHandle&)+0xa) [0x55a621b71496]
9: (ThreadPool::worker(ThreadPool::WorkThread*)+0x5a3) [0x7f48d59a0379]
10: (ThreadPool::WorkThread::entry()+0x11) [0x7f48d59a2af3]
11: (Thread::entry_wrapper()+0x3f) [0x7f48d5990229]
12: (Thread::_entry_func(void*)+0x9) [0x7f48d5990241]
13: /lib64/libc.so.6(+0x8b12d) [0x7f48d46ae12d]
14: /lib64/libc.so.6(+0x10cbc0) [0x7f48d472fbc0]
I hit this while using vstart_runner to run tests locally while two FSs were already running with cephfs-mirror daemon.
Steps:
1) Run cephfs-mirror daemon in foreground
2) Use vstart cluster and create an additional fs; enable mirroring on them and set peer, in my case a->b.
3) Run any mirroring test and the above backtrace would be visible in the logs.
Updated by Venky Shankar 10 months ago
Dhairya Parmar wrote:
If cephfs-mirror daemon faces any issues connecting to the cluster or error accessing local pool or mounting fs then it means there should not be any mirroring action in progress. Asserting below code directly in Mirror::handle_enable_mirroring() would fail in such scenarios
[...]
Are you sure that this crash is seen when the mirror daemon fails to connect to a cluster (or mount the file system)? Are you able to reproduce this with just fs_a -> fs_b replication (i.e., not using a fan-out setup)?
A good explanation by venky is that when file system is removed and the mirror daemon is still active on it, the watcher object gets deleted and at times, the mirror daemon would be blocked on a libcephfs call. Thus leading to issues
and would lead to backtraces like this:
No. The issue you quote above is a separate problem and does not related to this. `action_in_progress` is set when there is a pending action (enable/disable mirroring). The cephfs mirror daemon maintains a queue of action contexts per file system
std::list<Context *> action_ctxs;
and `action_in_progress` is set in Mirror::enable_mirroring() and Mirror::disable_mirroring() signifying that a mirror action is in-progress - which is checked in Mirror::update_fs_mirrors() - which completes contexts from `action_ctxs`.
[...]
I hit this while using vstart_runner to run tests locally while two FSs were already running with cephfs-mirror daemon.
Steps:
1) Run cephfs-mirror daemon in foreground
2) Use vstart cluster and create an additional fs; enable mirroring on them and set peer, in my case a->b.
3) Run any mirroring test and the above backtrace would be visible in the logs.
I think the root of this crash is
std::map<Filesystem, MirrorAction> m_mirror_actions;
Since they key is just a `Filesystem`, when you run a fan-out setup, the mirror daemon uses the same key (`Filesystem`) for which there is an already existing entry and pending actions.
Updated by Dhairya Parmar 10 months ago
Venky Shankar wrote:
Dhairya Parmar wrote:
If cephfs-mirror daemon faces any issues connecting to the cluster or error accessing local pool or mounting fs then it means there should not be any mirroring action in progress. Asserting below code directly in Mirror::handle_enable_mirroring() would fail in such scenarios
[...]Are you sure that this crash is seen when the mirror daemon fails to connect to a cluster (or mount the file system)? Are you able to reproduce this with just fs_a -> fs_b replication (i.e., not using a fan-out setup)?
Yes
A good explanation by venky is that when file system is removed and the mirror daemon is still active on it, the watcher object gets deleted and at times, the mirror daemon would be blocked on a libcephfs call. Thus leading to issues
and would lead to backtraces like this:
No. The issue you quote above is a separate problem and does not related to this. `action_in_progress` is set when there is a pending action (enable/disable mirroring). The cephfs mirror daemon maintains a queue of action contexts per file system
std::list<Context *> action_ctxs;
and `action_in_progress` is set in Mirror::enable_mirroring() and Mirror::disable_mirroring() signifying that a mirror action is in-progress - which is checked in Mirror::update_fs_mirrors() - which completes contexts from `action_ctxs`.
[...]
I hit this while using vstart_runner to run tests locally while two FSs were already running with cephfs-mirror daemon.
Steps:
1) Run cephfs-mirror daemon in foreground
2) Use vstart cluster and create an additional fs; enable mirroring on them and set peer, in my case a->b.
3) Run any mirroring test and the above backtrace would be visible in the logs.I think the root of this crash is
std::map<Filesystem, MirrorAction> m_mirror_actions;
Since they key is just a `Filesystem`, when you run a fan-out setup, the mirror daemon uses the same key (`Filesystem`) for which there is an already existing entry and pending actions.
I havent run fan-out setup yet locally
Updated by Venky Shankar 10 months ago
- Pull request ID set to 52524
(discussion continued on the PR)
Updated by Dhairya Parmar 9 months ago
- Subject changed from cephfs-mirror: assert action_in_progress in Mirror::handle_enable_mirroring() only if initialisation was successful to cephfs-mirror: do not run concurrent C_RestartMirroring context
- Status changed from New to Fix Under Review
- Assignee changed from Dhairya Parmar to Venky Shankar
- Severity changed from 3 - minor to 2 - major
- Pull request ID changed from 52524 to 52747
- Labels (FS) crash added
Quick summary:
After digging deep the issue was much more than anticipated and kudos to venky for figuring it out and pushing the required patch.
this assertion fails:
ceph_assert(mirror_action.action_in_progress);
It's a bit hard to reproduce and involves performing some manual steps + running a qa test case using vstart_runner:
1) ./bin/cephfs-mirror --id mirror --cluster ceph -f --debug_cephfs-mirror=20
2) ./bin/ceph fs volume create a ; ./bin/ceph fs volume create b ; ./bin/ceph fs volume create c ; ./bin/ceph fs volume create d ; ./bin/ceph fs volume create e
3) ./bin/ceph mgr module enable mirroring
4) ./bin/ceph fs snapshot mirror enable a ; ./bin/ceph fs snapshot mirror enable b; ./bin/ceph fs snapshot mirror enable c; ./bin/ceph fs snapshot mirror enable d; ./bin/ceph fs snapshot mirror enable e
5) ./bin/ceph fs snapshot mirror peer_add a client.mirror_remote@ceph b; ./bin/ceph fs snapshot mirror peer_add c client.mirror_remote@ceph d
6) source ../../venv/bin/activate ; python ../qa/tasks/vstart_runner.py tasks.cephfs.test_mirroring.TestMirroring
and generates a back trace:
In function 'void cephfs::mirror::Mirror::handle_enable_mirroring(const cephfs::mirror::Filesystem&, const Peers&, int)' thread 7f48d02e86c0 time 2023-07-18T16:01:47.895508+0530
/home/dparmar/CephRepo0/ceph/src/tools/cephfs_mirror/Mirror.cc: 322: FAILED ceph_assert(mirror_action.action_in_progress)
ceph version 18.0.0-4983-gfb339521a7f (fb339521a7f6895ed891b5b56de641e8085e0cab) reef (dev)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x125) [0x7f48d59acb0f]
2: (ceph::register_assert_context(ceph::common::CephContext*)+0) [0x7f48d59acd16]
3: (cephfs::mirror::Mirror::handle_enable_mirroring(cephfs::mirror::Filesystem const&, std::set<Peer, std::less<Peer>, std::allocator<Peer> > const&, int)+0x168) [0x55a621b6e6fa]
4: (cephfs::mirror::Mirror::C_RestartMirroring::handle_enable_mirroring(int)+0x17) [0x55a621b770bd]
5: (cephfs::mirror::C_CallbackAdapter<cephfs::mirror::Mirror::C_RestartMirroring, &cephfs::mirror::Mirror::C_RestartMirroring::handle_enable_mirroring>::finish(int)+0xd) [0x55a621b770d5]
6: (Context::complete(int)+0x9) [0x55a621b71317]
7: (ContextWQ::process(Context*)+0x69) [0x55a621b7550f]
8: (ThreadPool::PointerWQ<Context>::_void_process(void*, ThreadPool::TPHandle&)+0xa) [0x55a621b71496]
9: (ThreadPool::worker(ThreadPool::WorkThread*)+0x5a3) [0x7f48d59a0379]
10: (ThreadPool::WorkThread::entry()+0x11) [0x7f48d59a2af3]
11: (Thread::entry_wrapper()+0x3f) [0x7f48d5990229]
12: (Thread::_entry_func(void*)+0x9) [0x7f48d5990241]
13: /lib64/libc.so.6(+0x8b12d) [0x7f48d46ae12d]
14: /lib64/libc.so.6(+0x10cbc0) [0x7f48d472fbc0]
This happens when there are failures in mirror daemon, in this case the class CephFSTestCase's delete_all_filesystems() is executed while running qa test case in 6th step; it is too rude for the filesystems and mirror daemon fails. This triggers concurrent C_RestartMirroring contexts which modify mirror_action.action_in_progress on the fly; and leads the assertion to fail and thus causes the crash. This is now fixed in the PR venky raised where running concurrent C_RestartMirroring contexts is disallowed.
Updated by Venky Shankar 7 months ago
- Status changed from Fix Under Review to Pending Backport
- Backport set to reef,quincy,pacific
Updated by Backport Bot 7 months ago
- Copied to Backport #62948: quincy: cephfs-mirror: do not run concurrent C_RestartMirroring context added
Updated by Backport Bot 7 months ago
- Copied to Backport #62949: pacific: cephfs-mirror: do not run concurrent C_RestartMirroring context added
Updated by Backport Bot 7 months ago
- Copied to Backport #62950: reef: cephfs-mirror: do not run concurrent C_RestartMirroring context added
Updated by Konstantin Shalygin 6 months ago
- Status changed from Pending Backport to Resolved
- % Done changed from 0 to 100