Project

General

Profile

Actions

Bug #62072

closed

cephfs-mirror: do not run concurrent C_RestartMirroring context

Added by Dhairya Parmar 10 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
-
% Done:

100%

Source:
Development
Tags:
backport_processed
Backport:
reef,quincy,pacific
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
cephfs-mirror
Labels (FS):
crash
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

If cephfs-mirror daemon faces any issues connecting to the cluster or error accessing local pool or mounting fs then it means there should not be any mirroring action in progress. Asserting below code directly in Mirror::handle_enable_mirroring() would fail in such scenarios

ceph_assert(mirror_action.action_in_progress);

A good explanation by venky is that when file system is removed and the mirror daemon is still active on it, the watcher object gets deleted and at times, the mirror daemon would be blocked on a libcephfs call. Thus leading to issues

and would lead to backtraces like this:

In function 'void cephfs::mirror::Mirror::handle_enable_mirroring(const cephfs::mirror::Filesystem&, const Peers&, int)' thread 7f48d02e86c0 time 2023-07-18T16:01:47.895508+0530
/home/dparmar/CephRepo0/ceph/src/tools/cephfs_mirror/Mirror.cc: 322: FAILED ceph_assert(mirror_action.action_in_progress)

 ceph version 18.0.0-4983-gfb339521a7f (fb339521a7f6895ed891b5b56de641e8085e0cab) reef (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x125) [0x7f48d59acb0f]
 2: (ceph::register_assert_context(ceph::common::CephContext*)+0) [0x7f48d59acd16]
 3: (cephfs::mirror::Mirror::handle_enable_mirroring(cephfs::mirror::Filesystem const&, std::set<Peer, std::less<Peer>, std::allocator<Peer> > const&, int)+0x168) [0x55a621b6e6fa]
 4: (cephfs::mirror::Mirror::C_RestartMirroring::handle_enable_mirroring(int)+0x17) [0x55a621b770bd]
 5: (cephfs::mirror::C_CallbackAdapter<cephfs::mirror::Mirror::C_RestartMirroring, &cephfs::mirror::Mirror::C_RestartMirroring::handle_enable_mirroring>::finish(int)+0xd) [0x55a621b770d5]
 6: (Context::complete(int)+0x9) [0x55a621b71317]
 7: (ContextWQ::process(Context*)+0x69) [0x55a621b7550f]
 8: (ThreadPool::PointerWQ<Context>::_void_process(void*, ThreadPool::TPHandle&)+0xa) [0x55a621b71496]
 9: (ThreadPool::worker(ThreadPool::WorkThread*)+0x5a3) [0x7f48d59a0379]
 10: (ThreadPool::WorkThread::entry()+0x11) [0x7f48d59a2af3]
 11: (Thread::entry_wrapper()+0x3f) [0x7f48d5990229]
 12: (Thread::_entry_func(void*)+0x9) [0x7f48d5990241]
 13: /lib64/libc.so.6(+0x8b12d) [0x7f48d46ae12d]
 14: /lib64/libc.so.6(+0x10cbc0) [0x7f48d472fbc0]

I hit this while using vstart_runner to run tests locally while two FSs were already running with cephfs-mirror daemon.

Steps:
1) Run cephfs-mirror daemon in foreground
2) Use vstart cluster and create an additional fs; enable mirroring on them and set peer, in my case a->b.
3) Run any mirroring test and the above backtrace would be visible in the logs.


Related issues 3 (0 open3 closed)

Copied to CephFS - Backport #62948: quincy: cephfs-mirror: do not run concurrent C_RestartMirroring contextResolvedVenky ShankarActions
Copied to CephFS - Backport #62949: pacific: cephfs-mirror: do not run concurrent C_RestartMirroring contextResolvedVenky ShankarActions
Copied to CephFS - Backport #62950: reef: cephfs-mirror: do not run concurrent C_RestartMirroring contextResolvedVenky ShankarActions
Actions #1

Updated by Venky Shankar 10 months ago

Dhairya Parmar wrote:

If cephfs-mirror daemon faces any issues connecting to the cluster or error accessing local pool or mounting fs then it means there should not be any mirroring action in progress. Asserting below code directly in Mirror::handle_enable_mirroring() would fail in such scenarios
[...]

Are you sure that this crash is seen when the mirror daemon fails to connect to a cluster (or mount the file system)? Are you able to reproduce this with just fs_a -> fs_b replication (i.e., not using a fan-out setup)?

A good explanation by venky is that when file system is removed and the mirror daemon is still active on it, the watcher object gets deleted and at times, the mirror daemon would be blocked on a libcephfs call. Thus leading to issues

and would lead to backtraces like this:

No. The issue you quote above is a separate problem and does not related to this. `action_in_progress` is set when there is a pending action (enable/disable mirroring). The cephfs mirror daemon maintains a queue of action contexts per file system

std::list&lt;Context *&gt; action_ctxs;

and `action_in_progress` is set in Mirror::enable_mirroring() and Mirror::disable_mirroring() signifying that a mirror action is in-progress - which is checked in Mirror::update_fs_mirrors() - which completes contexts from `action_ctxs`.

[...]

I hit this while using vstart_runner to run tests locally while two FSs were already running with cephfs-mirror daemon.

Steps:
1) Run cephfs-mirror daemon in foreground
2) Use vstart cluster and create an additional fs; enable mirroring on them and set peer, in my case a->b.
3) Run any mirroring test and the above backtrace would be visible in the logs.

I think the root of this crash is

std::map&lt;Filesystem, MirrorAction&gt; m_mirror_actions;

Since they key is just a `Filesystem`, when you run a fan-out setup, the mirror daemon uses the same key (`Filesystem`) for which there is an already existing entry and pending actions.

Actions #2

Updated by Dhairya Parmar 10 months ago

Venky Shankar wrote:

Dhairya Parmar wrote:

If cephfs-mirror daemon faces any issues connecting to the cluster or error accessing local pool or mounting fs then it means there should not be any mirroring action in progress. Asserting below code directly in Mirror::handle_enable_mirroring() would fail in such scenarios
[...]

Are you sure that this crash is seen when the mirror daemon fails to connect to a cluster (or mount the file system)? Are you able to reproduce this with just fs_a -> fs_b replication (i.e., not using a fan-out setup)?

Yes

A good explanation by venky is that when file system is removed and the mirror daemon is still active on it, the watcher object gets deleted and at times, the mirror daemon would be blocked on a libcephfs call. Thus leading to issues

and would lead to backtraces like this:

No. The issue you quote above is a separate problem and does not related to this. `action_in_progress` is set when there is a pending action (enable/disable mirroring). The cephfs mirror daemon maintains a queue of action contexts per file system

std::list<Context *> action_ctxs;

and `action_in_progress` is set in Mirror::enable_mirroring() and Mirror::disable_mirroring() signifying that a mirror action is in-progress - which is checked in Mirror::update_fs_mirrors() - which completes contexts from `action_ctxs`.

[...]

I hit this while using vstart_runner to run tests locally while two FSs were already running with cephfs-mirror daemon.

Steps:
1) Run cephfs-mirror daemon in foreground
2) Use vstart cluster and create an additional fs; enable mirroring on them and set peer, in my case a->b.
3) Run any mirroring test and the above backtrace would be visible in the logs.

I think the root of this crash is

std::map<Filesystem, MirrorAction> m_mirror_actions;

Since they key is just a `Filesystem`, when you run a fan-out setup, the mirror daemon uses the same key (`Filesystem`) for which there is an already existing entry and pending actions.

I havent run fan-out setup yet locally

Actions #3

Updated by Venky Shankar 10 months ago

  • Pull request ID set to 52524

(discussion continued on the PR)

Actions #4

Updated by Dhairya Parmar 9 months ago

  • Subject changed from cephfs-mirror: assert action_in_progress in Mirror::handle_enable_mirroring() only if initialisation was successful to cephfs-mirror: do not run concurrent C_RestartMirroring context
  • Status changed from New to Fix Under Review
  • Assignee changed from Dhairya Parmar to Venky Shankar
  • Severity changed from 3 - minor to 2 - major
  • Pull request ID changed from 52524 to 52747
  • Labels (FS) crash added

Quick summary:

After digging deep the issue was much more than anticipated and kudos to venky for figuring it out and pushing the required patch.

this assertion fails:

ceph_assert(mirror_action.action_in_progress);

It's a bit hard to reproduce and involves performing some manual steps + running a qa test case using vstart_runner:

1) ./bin/cephfs-mirror --id mirror --cluster ceph -f --debug_cephfs-mirror=20
2) ./bin/ceph fs volume create a ; ./bin/ceph fs volume create b ; ./bin/ceph fs volume create c ; ./bin/ceph fs volume create d ; ./bin/ceph fs volume create e
3) ./bin/ceph mgr module enable mirroring 
4) ./bin/ceph fs snapshot mirror enable a ; ./bin/ceph fs snapshot mirror enable b; ./bin/ceph fs snapshot mirror enable c; ./bin/ceph fs snapshot mirror enable d; ./bin/ceph fs snapshot mirror enable e 
5) ./bin/ceph fs snapshot mirror peer_add a client.mirror_remote@ceph b; ./bin/ceph fs snapshot mirror peer_add c client.mirror_remote@ceph d
6) source ../../venv/bin/activate ; python ../qa/tasks/vstart_runner.py tasks.cephfs.test_mirroring.TestMirroring

and generates a back trace:

In function 'void cephfs::mirror::Mirror::handle_enable_mirroring(const cephfs::mirror::Filesystem&, const Peers&, int)' thread 7f48d02e86c0 time 2023-07-18T16:01:47.895508+0530
/home/dparmar/CephRepo0/ceph/src/tools/cephfs_mirror/Mirror.cc: 322: FAILED ceph_assert(mirror_action.action_in_progress)

 ceph version 18.0.0-4983-gfb339521a7f (fb339521a7f6895ed891b5b56de641e8085e0cab) reef (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x125) [0x7f48d59acb0f]
 2: (ceph::register_assert_context(ceph::common::CephContext*)+0) [0x7f48d59acd16]
 3: (cephfs::mirror::Mirror::handle_enable_mirroring(cephfs::mirror::Filesystem const&, std::set<Peer, std::less<Peer>, std::allocator<Peer> > const&, int)+0x168) [0x55a621b6e6fa]
 4: (cephfs::mirror::Mirror::C_RestartMirroring::handle_enable_mirroring(int)+0x17) [0x55a621b770bd]
 5: (cephfs::mirror::C_CallbackAdapter<cephfs::mirror::Mirror::C_RestartMirroring, &cephfs::mirror::Mirror::C_RestartMirroring::handle_enable_mirroring>::finish(int)+0xd) [0x55a621b770d5]
 6: (Context::complete(int)+0x9) [0x55a621b71317]
 7: (ContextWQ::process(Context*)+0x69) [0x55a621b7550f]
 8: (ThreadPool::PointerWQ<Context>::_void_process(void*, ThreadPool::TPHandle&)+0xa) [0x55a621b71496]
 9: (ThreadPool::worker(ThreadPool::WorkThread*)+0x5a3) [0x7f48d59a0379]
 10: (ThreadPool::WorkThread::entry()+0x11) [0x7f48d59a2af3]
 11: (Thread::entry_wrapper()+0x3f) [0x7f48d5990229]
 12: (Thread::_entry_func(void*)+0x9) [0x7f48d5990241]
 13: /lib64/libc.so.6(+0x8b12d) [0x7f48d46ae12d]
 14: /lib64/libc.so.6(+0x10cbc0) [0x7f48d472fbc0]

This happens when there are failures in mirror daemon, in this case the class CephFSTestCase's delete_all_filesystems() is executed while running qa test case in 6th step; it is too rude for the filesystems and mirror daemon fails. This triggers concurrent C_RestartMirroring contexts which modify mirror_action.action_in_progress on the fly; and leads the assertion to fail and thus causes the crash. This is now fixed in the PR venky raised where running concurrent C_RestartMirroring contexts is disallowed.

Actions #5

Updated by Venky Shankar 7 months ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to reef,quincy,pacific
Actions #6

Updated by Backport Bot 7 months ago

  • Copied to Backport #62948: quincy: cephfs-mirror: do not run concurrent C_RestartMirroring context added
Actions #7

Updated by Backport Bot 7 months ago

  • Copied to Backport #62949: pacific: cephfs-mirror: do not run concurrent C_RestartMirroring context added
Actions #8

Updated by Backport Bot 7 months ago

  • Copied to Backport #62950: reef: cephfs-mirror: do not run concurrent C_RestartMirroring context added
Actions #9

Updated by Backport Bot 7 months ago

  • Tags set to backport_processed
Actions #10

Updated by Konstantin Shalygin 6 months ago

  • Status changed from Pending Backport to Resolved
  • % Done changed from 0 to 100
Actions

Also available in: Atom PDF