Bug #61182: qa: workloads/cephfs-mirror-ha-workunit - stopping mirror daemon after the test finishes timesout. - CephFS - Ceph

Actions

Copy link

Bug #61182

closed

qa: workloads/cephfs-mirror-ha-workunit - stopping mirror daemon after the test finishes timesout.

Added by Kotresh Hiremath Ravishankar 11 months ago. Updated about 2 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

Jos Collin

Category:

Target version:

Ceph - v19.0.0

% Done:

Source:

Q/A

Tags:

backport_processed

Backport:

reef,quincy,pacific

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

qa-failure

Pull request ID:

52600

Crash signature (v1):

Crash signature (v2):

Description

Description: fs/mirror-ha/{begin/{0-install 1-ceph 2-logrotate} cephfs-mirror/three-per-cluster clients/{mirror} cluster/{1-node} objectstore/bluestore-bitmap overrides/{whitelist_health} supported-random-distro$/{rhel_8} workloads/cephfs-mirror-ha-workunit}

Job: https://pulpito.ceph.com/yuriw-2023-05-09_19:37:41-fs-wip-yuri10-testing-2023-05-08-0849-reef-distro-default-smithi/7268721/

Log: http://qa-proxy.ceph.com/teuthology/yuriw-2023-05-09_19:37:41-fs-wip-yuri10-testing-2023-05-08-0849-reef-distro-default-smithi/7268721/teuthology.log

The test finishes but stopping mirror daemon times out sometimes.
Failure Reason: reached maximum tries (50) after waiting for 300 seconds

2023-05-09T20:42:21.156 ERROR:teuthology.contextutil:Saw exception from nested tasks
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_69caab8ec6f44c61c991ae693e8e7b8cd859a896/teuthology/contextutil.py", line 33, in nested
    yield vars
  File "/home/teuthworker/src/github.com_ceph_ceph-c_6c0cdcda80131534ed32c70076ff630972dadd4f/qa/tasks/ceph.py", line 1935, in task
    yield
  File "/home/teuthworker/src/git.ceph.com_teuthology_69caab8ec6f44c61c991ae693e8e7b8cd859a896/teuthology/run_tasks.py", line 192, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/home/teuthworker/src/git.ceph.com_teuthology_69caab8ec6f44c61c991ae693e8e7b8cd859a896/teuthology/task/__init__.py", line 132, in __exit__
    self.end()
  File "/home/teuthworker/src/github.com_ceph_ceph-c_6c0cdcda80131534ed32c70076ff630972dadd4f/qa/tasks/cephfs_mirror.py", line 70, in end
    mirror_daemon.stop()
  File "/home/teuthworker/src/git.ceph.com_teuthology_69caab8ec6f44c61c991ae693e8e7b8cd859a896/teuthology/orchestra/daemon/state.py", line 139, in stop
    run.wait([self.proc], timeout=timeout)
  File "/home/teuthworker/src/git.ceph.com_teuthology_69caab8ec6f44c61c991ae693e8e7b8cd859a896/teuthology/orchestra/run.py", line 473, in wait
    check_time()
  File "/home/teuthworker/src/git.ceph.com_teuthology_69caab8ec6f44c61c991ae693e8e7b8cd859a896/teuthology/contextutil.py", line 135, in __call__
    raise MaxWhileTries(error_msg)
teuthology.exceptions.MaxWhileTries: reached maximum tries (50) after waiting for 300 seconds

Files

ceph-client.mirror1.42532.log (59.9 KB) ceph-client.mirror1.42532.log

mirror daemon log in job 7331946

Jos Collin, 07/11/2023 10:09 AM

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by Kotresh Hiremath Ravishankar 11 months ago

Description updated (diff)

Actions

Copy link

Updated by Kotresh Hiremath Ravishankar 11 months ago

Description updated (diff)

Actions

Copy link

Updated by Kotresh Hiremath Ravishankar 11 months ago

another instance in reef - https://pulpito.ceph.com/yuriw-2023-05-15_15:22:39-fs-wip-yuri6-testing-2023-04-26-1247-reef-distro-default-smithi/7274313

Actions

Copy link

Updated by Laura Flores 11 months ago

Translation missing: en.field_tag_list set to test-failure

Actions

Copy link

Updated by Milind Changire 11 months ago

Assignee set to Jos Collin

Actions

Copy link

Updated by Kotresh Hiremath Ravishankar 11 months ago

reef
https://pulpito.ceph.com/yuriw-2023-05-22_14:44:12-fs-wip-yuri3-testing-2023-05-21-0740-reef-distro-default-smithi/7282563

Actions

Copy link

Updated by Kotresh Hiremath Ravishankar 11 months ago

quincy:
https://pulpito.ceph.com/yuriw-2023-05-23_15:23:11-fs-wip-yuri10-testing-2023-05-18-0815-quincy-distro-default-smithi/7284025

Actions

Copy link

Updated by Kotresh Hiremath Ravishankar 11 months ago

reef:
https://pulpito.ceph.com/yuriw-2023-05-28_14:46:14-fs-reef-release-distro-default-smithi/7288953

Actions

Copy link

Updated by Milind Changire 11 months ago

quincy:
http://pulpito.front.sepia.ceph.com/yuriw-2023-05-31_21:56:15-fs-wip-yuri6-testing-2023-05-31-0933-quincy-distro-default-smithi/7292746/

Actions

Copy link

#10

Updated by Jos Collin 11 months ago

Status changed from New to In Progress

Actions

Copy link

#11

Updated by Milind Changire 10 months ago

reef:
http://pulpito.front.sepia.ceph.com/yuriw-2023-06-23_16:20:29-fs-wip-yuri11-testing-2023-06-19-1232-reef-distro-default-smithi/7313659

Actions

Copy link

#12

Updated by Jos Collin 10 months ago

Status changed from In Progress to Can't reproduce

This doesn't happen consistently in main and in reef branch. I ran the fs:mirror-ha test suite multiple times on the main branch and the reef branch. It fails sometimes with the below exception, but passes in the next run.

raise MaxWhileTries(error_msg)
teuthology.exceptions.MaxWhileTries: reached maximum tries (50) after waiting for 300 seconds

Passes here on main branch:
http://pulpito.front.sepia.ceph.com/jcollin-2023-07-06_14:22:42-fs:mirror-ha-wip-cephfs-mirror-fix_main-distro-default-smithi/

Then fails in the very next run:
http://pulpito.front.sepia.ceph.com/jcollin-2023-07-07_03:44:39-fs:mirror-ha-wip-cephfs-mirror-fix_main-distro-default-smithi/

Fails on the reef branch:
http://pulpito.front.sepia.ceph.com/jcollin-2023-07-07_01:21:59-fs:mirror-ha-wip-cephfs-mirror-fix_reef-distro-default-smithi/

Then passed in the next run(s):
http://pulpito.front.sepia.ceph.com/jcollin-2023-07-07_08:55:00-fs:mirror-ha-wip-cephfs-mirror-fix_reef-distro-default-smithi/
http://pulpito.front.sepia.ceph.com/jcollin-2023-07-07_09:36:43-fs:mirror-ha-wip-cephfs-mirror-fix_reef-distro-default-smithi/
http://pulpito.front.sepia.ceph.com/jcollin-2023-07-07_10:15:52-fs:mirror-ha-wip-cephfs-mirror-fix_reef-distro-default-smithi/

So it's not consistently reproducible.

Actions

Copy link

#13

Updated by Jos Collin 10 months ago

Status changed from Can't reproduce to In Progress

In quincy branch, this is consistently reproducible:

http://pulpito.front.sepia.ceph.com/jcollin-2023-07-10_04:22:36-fs:mirror-ha-wip-cephfs-mirror-fix_quincy-distro-default-smithi/
http://pulpito.front.sepia.ceph.com/jcollin-2023-07-10_05:19:42-fs:mirror-ha-wip-cephfs-mirror-fix_quincy-distro-default-smithi/
http://pulpito.front.sepia.ceph.com/jcollin-2023-07-10_05:21:33-fs:mirror-ha-wip-cephfs-mirror-fix_quincy-distro-default-smithi/
http://pulpito.front.sepia.ceph.com/jcollin-2023-07-10_06:09:34-fs:mirror-ha-wip-cephfs-mirror-fix_quincy-distro-default-smithi/

These 4 consecutive runs failed.
Moving this ticket to 'in progress'.

Actions

Copy link

#14

Updated by Venky Shankar 10 months ago

Jos Collin wrote:

In quincy branch, this is consistently reproducible:

http://pulpito.front.sepia.ceph.com/jcollin-2023-07-10_04:22:36-fs:mirror-ha-wip-cephfs-mirror-fix_quincy-distro-default-smithi/
http://pulpito.front.sepia.ceph.com/jcollin-2023-07-10_05:19:42-fs:mirror-ha-wip-cephfs-mirror-fix_quincy-distro-default-smithi/
http://pulpito.front.sepia.ceph.com/jcollin-2023-07-10_05:21:33-fs:mirror-ha-wip-cephfs-mirror-fix_quincy-distro-default-smithi/
http://pulpito.front.sepia.ceph.com/jcollin-2023-07-10_06:09:34-fs:mirror-ha-wip-cephfs-mirror-fix_quincy-distro-default-smithi/

These 4 consecutive runs failed.
Moving this ticket to 'in progress'.

As discussed, I think the pools are removed and the mirror daemon has a watch on objects in the pool. This could be causing the watch() call to return with an error that is probably not being handled by the mirror daemon.

Actions

Copy link

#15

Updated by Venky Shankar 10 months ago

Target version set to v19.0.0
Backport set to reef,quincy,pacific

Actions

Copy link

#16

Updated by Jos Collin 10 months ago

File ceph-client.mirror1.42532.log ceph-client.mirror1.42532.log added

@Venky,

As discussed attaching job [1] and the mirror daemon log, which I've been referring to.

[1] http://pulpito.front.sepia.ceph.com/jcollin-2023-07-10_04:22:36-fs:mirror-ha-wip-cephfs-mirror-fix_quincy-distro-default-smithi/7331946/

Actions

Copy link

#17

Updated by Venky Shankar 10 months ago

Jos Collin wrote:

@Venky,

As discussed attaching job [1] and the mirror daemon log, which I've been referring to.

[1] http://pulpito.front.sepia.ceph.com/jcollin-2023-07-10_04:22:36-fs:mirror-ha-wip-cephfs-mirror-fix_quincy-distro-default-smithi/7331946/

ACK - will have a look.

Actions

Copy link

#18

Updated by Rishabh Dave 10 months ago

rishabh-2023-07-06_19:46:43-fs-wip-rishabh-CephManager-in-CephFSTestCase-testing-default-smithi/7328210/

Actions

Copy link

#19

Updated by Jos Collin 10 months ago

Rishabh Dave wrote:

rishabh-2023-07-06_19:46:43-fs-wip-rishabh-CephManager-in-CephFSTestCase-testing-default-smithi/7328210/

Only the quincy runs seems consistently failing.

Actions

Copy link

#20

Updated by Venky Shankar 10 months ago

Jos Collin wrote:

@Venky,

As discussed attaching job [1] and the mirror daemon log, which I've been referring to.

[1] http://pulpito.front.sepia.ceph.com/jcollin-2023-07-10_04:22:36-fs:mirror-ha-wip-cephfs-mirror-fix_quincy-distro-default-smithi/7331946/

The mirror daemon (log) which did not shutdown is

./remote/smithi061/log/ceph-client.mirror2.42402.log.gz

Actions

Copy link

#21

Updated by Venky Shankar 10 months ago

Out of the 3 replayer threads, only two exited when the mirror daemon was shutting down:

zgrep "\: exiting" ./remote/smithi061/log/ceph-client.mirror2.42402.log.gz
2023-07-10T04:56:12.686+0000 7f677b6a9700  5 cephfs::mirror::PeerReplayer(e04e5833-38c2-4c17-bcee-9ec7c5429e9f) run: exiting
2023-07-10T04:56:12.686+0000 7f677c6ab700  5 cephfs::mirror::PeerReplayer(e04e5833-38c2-4c17-bcee-9ec7c5429e9f) run: exiting

The pending thread never exited, causing the shutdown to never finish.

Actions

Copy link

#22

Updated by Venky Shankar 10 months ago

The thread (7f677beaa700) was blocked on a file system call to build snap mapping (local vs remote)

2023-07-10T04:56:11.703+0000 7f677beaa700 20 cephfs::mirror::PeerReplayer(e04e5833-38c2-4c17-bcee-9ec7c5429e9f) run: trying to pick from 1 directories
2023-07-10T04:56:11.703+0000 7f677beaa700 20 cephfs::mirror::PeerReplayer(e04e5833-38c2-4c17-bcee-9ec7c5429e9f) pick_directory
2023-07-10T04:56:11.703+0000 7f677beaa700  5 cephfs::mirror::PeerReplayer(e04e5833-38c2-4c17-bcee-9ec7c5429e9f) run: picked dir_root=/mirror/ceph_repo/ceph-qa-suite_1
2023-07-10T04:56:11.703+0000 7f677beaa700 20 cephfs::mirror::PeerReplayer(e04e5833-38c2-4c17-bcee-9ec7c5429e9f) register_directory: dir_root=/mirror/ceph_repo/ceph-qa-suite_1
2023-07-10T04:56:11.703+0000 7f677beaa700 20 cephfs::mirror::PeerReplayer(e04e5833-38c2-4c17-bcee-9ec7c5429e9f) try_lock_directory: dir_root=/mirror/ceph_repo/ceph-qa-suite_1
2023-07-10T04:56:11.703+0000 7f677beaa700 10 cephfs::mirror::PeerReplayer(e04e5833-38c2-4c17-bcee-9ec7c5429e9f) try_lock_directory: dir_root=/mirror/ceph_repo/ceph-qa-suite_1 locked
2023-07-10T04:56:11.703+0000 7f677beaa700  5 cephfs::mirror::PeerReplayer(e04e5833-38c2-4c17-bcee-9ec7c5429e9f) register_directory: dir_root=/mirror/ceph_repo/ceph-qa-suite_1 registered with replayer=0x555b6b9a7340
2023-07-10T04:56:11.703+0000 7f677beaa700 20 cephfs::mirror::PeerReplayer(e04e5833-38c2-4c17-bcee-9ec7c5429e9f) sync_snaps: dir_root=/mirror/ceph_repo/ceph-qa-suite_1
2023-07-10T04:56:11.703+0000 7f677beaa700 20 cephfs::mirror::PeerReplayer(e04e5833-38c2-4c17-bcee-9ec7c5429e9f) do_sync_snaps: dir_root=/mirror/ceph_repo/ceph-qa-suite_1
2023-07-10T04:56:11.703+0000 7f677beaa700 20 cephfs::mirror::PeerReplayer(e04e5833-38c2-4c17-bcee-9ec7c5429e9f) build_snap_map: dir_root=/mirror/ceph_repo/ceph-qa-suite_1, snap_dir=/mirror/ceph_repo/ceph-qa-suite_1/.snap, is_remote=0
2023-07-10T04:56:12.451+0000 7f67906d3700 20 cephfs::mirror::Watcher handle_notify: notify_id=227633267391, handle=93851095683456, notifier_id=4383

PeerReplayer::build_snap_map() calls into libcephfs. What happened is the file system got removed while the mirror daemon was making calling a file system libcephfs API. This is fine as the verification of the synchronized snapshots finished:

2023-07-10T04:55:43.696 INFO:tasks.workunit.client.2.smithi061.stderr:+ echo 'verify done!'
2023-07-10T04:55:43.696 INFO:tasks.workunit:Stopping ['fs/cephfs_mirror_ha_verify.sh'] on client.2...

2023-07-10T04:56:10.248 DEBUG:teuthology.run_tasks:Unwinding manager full_sequential_finally
2023-07-10T04:56:10.268 INFO:teuthology.task.full_sequential_finally:In full_sequential_finally, running task exec...
2023-07-10T04:56:10.269 INFO:teuthology.task.exec:Executing custom commands...
2023-07-10T04:56:10.269 INFO:teuthology.task.exec:Running commands on role mon.a host ubuntu@smithi061.front.sepia.ceph.com
2023-07-10T04:56:10.269 DEBUG:teuthology.orchestra.run.smithi061:> sudo TESTDIR=/home/ubuntu/cephtest bash -c 'ceph config set mon mon_allow_pool_delete true'
2023-07-10T04:56:10.598 DEBUG:teuthology.orchestra.run.smithi061:> sudo TESTDIR=/home/ubuntu/cephtest bash -c 'ceph fs volume rm dc --yes-i-really-mean-it'
2023-07-10T04:56:11.573 INFO:tasks.ceph.mon.a.smithi061.stderr:2023-07-10T04:56:11.567+0000 7f65e40b2700 -1 log_channel(cluster) log [ERR] : Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)
2023-07-10T04:56:11.622 INFO:tasks.ceph.mds.b.smithi061.stdout:starting mds.b at
2023-07-10T04:56:14.595 INFO:teuthology.orchestra.run.smithi061.stdout:metadata pool: cephfs.dc.meta data pool: ['cephfs.dc.data'] removed
2023-07-10T04:56:14.624 DEBUG:teuthology.orchestra.run.smithi061:> sudo TESTDIR=/home/ubuntu/cephtest bash -c 'ceph fs volume rm dc-backup --yes-i-really-mean-it'

We have seen this behaviour before - if a file system is removed and a client is calling an file system related API that involves talking to MDSs, then the API call never returns. Maybe we want to handle this more gracefully in the client. I'll create a separate tracker for that.

As far as this issue is concerned, there are a couple of options:

- stop the cephfs-mirror thrasher task in cephfs-mirror-ha-workunit.yaml under full_sequential_finally section
- kill the blocked thread in cephfs-mirror daemon if the thread does not exit after it has been requested to stop

Actions

Copy link

#23

Updated by Greg Farnum 10 months ago

I talked about this with Jos today and see that when the cephfs_mirror_thrash.py joins the background thread, the do_thrash() function which turns off and then restarts mirror daemons doesn't actually stop the mirror daemons before exiting, once the stopping flag is set.

It seems like that's the root issue?

Actions

Copy link

#24

Updated by Greg Farnum 10 months ago

Oh, I guess the daemons are created via the qa/suites/fs/mirror-ha/cephfs-mirror/three-per-cluster.yaml fragment. Looking at eg http://qa-proxy.ceph.com/teuthology/yuriw-2023-05-09_19:37:41-fs-wip-yuri10-testing-2023-05-08-0849-reef-distro-default-smithi/7268721/config.yaml the ordering of the overall yaml seems okay?

But I worry that maybe the mirror daemons are getting disassociated somehow and not signaled to shut down at all?

Aha, qa/tasks/cephfs_mirror.py does not try and get the return value of the daemon. Does that mean it doesn't wait for the daemon to finish shutting down?

Actions

Copy link

#25

Updated by Venky Shankar 10 months ago

Greg Farnum wrote:

Oh, I guess the daemons are created via the qa/suites/fs/mirror-ha/cephfs-mirror/three-per-cluster.yaml fragment. Looking at eg http://qa-proxy.ceph.com/teuthology/yuriw-2023-05-09_19:37:41-fs-wip-yuri10-testing-2023-05-08-0849-reef-distro-default-smithi/7268721/config.yaml the ordering of the overall yaml seems okay?

Yes. The thrasher just does: pick-some, kill, restart until its told to stop. Doesn't care about leaving the mirror daemons in any particular state.

But I worry that maybe the mirror daemons are getting disassociated somehow and not signaled to shut down at all?

This isn't the case as far as I can see from the logs. The mirror daemon which failed to shut down was waiting for a (replayer) thread to exit, but that thread was stuck in a file system call since the file system was removed underneath.

Aha, qa/tasks/cephfs_mirror.py does not try and get the return value of the daemon. Does that mean it doesn't wait for the daemon to finish shutting down?

This task spawns a single mirror daemon and is not used to test the HA setup. The snapshot mirroring tests do all kinds of validations (matching checksumsm, etc..). The return values aren't particularly interesting since the mirror daemon basically does:

src/tools/cephfs_mirror/main.cc: return forker.signal_exit(r);

"r" would be 0 if the mirror daemon started all fine.

Actions

Copy link

#26

Updated by Jos Collin 10 months ago

Venky Shankar wrote:

Out of the 3 replayer threads, only two exited when the mirror daemon was shutting down:

[...]

The pending thread never exited, causing the shutdown to never finish.

[1] http://pulpito.front.sepia.ceph.com/jcollin-2023-07-14_08:13:59-fs:mirror-ha-wip-cephfs-mirror-fix_reef-distro-default-smithi/7337362
[2] http://pulpito.front.sepia.ceph.com/jcollin-2023-07-14_08:14:24-fs:mirror-ha-wip-cephfs-mirror-fix_reef-distro-default-smithi/7337368
I got an exception here^. All 3 replayer threads seems exited. But the job failed with the same error.

Actions

Copy link

#27

Updated by Venky Shankar 9 months ago

Jos Collin wrote:

Venky Shankar wrote:

Out of the 3 replayer threads, only two exited when the mirror daemon was shutting down:

[...]

The pending thread never exited, causing the shutdown to never finish.

[1] http://pulpito.front.sepia.ceph.com/jcollin-2023-07-14_08:13:59-fs:mirror-ha-wip-cephfs-mirror-fix_reef-distro-default-smithi/7337362

$ zegrep '7fddbe350700|7fddbdb4f700|7fddbd34e700' ./remote/smithi081/log/ceph-client.mirror2.130313.log.gz

None of the 3 replayer threads exited.

Actions

Copy link

#28

Updated by Venky Shankar 9 months ago

From GChat:

you could introduce a config variable to kill the threads during shutdown if they are blocked for more then say 5 minutes
then, enable this config when running in teuthology
so, that the mirror daemon would kill off those blocked threads

This needs to be enabled for teuthology tests only as a workaround. Normally, users would disable filesystem mirroring when deleteing file systems, and/or we can add guard to do switch it off before removing file systems.

Actions

Copy link

#29

Updated by Greg Farnum 9 months ago

Venky Shankar wrote:

Greg Farnum wrote:

Oh, I guess the daemons are created via the qa/suites/fs/mirror-ha/cephfs-mirror/three-per-cluster.yaml fragment. Looking at eg http://qa-proxy.ceph.com/teuthology/yuriw-2023-05-09_19:37:41-fs-wip-yuri10-testing-2023-05-08-0849-reef-distro-default-smithi/7268721/config.yaml the ordering of the overall yaml seems okay?

Yes. The thrasher just does: pick-some, kill, restart until its told to stop. Doesn't care about leaving the mirror daemons in any particular state.

But I worry that maybe the mirror daemons are getting disassociated somehow and not signaled to shut down at all?

This isn't the case as far as I can see from the logs. The mirror daemon which failed to shut down was waiting for a (replayer) thread to exit, but that thread was stuck in a file system call since the file system was removed underneath.

Isn't that because the mirror daemon is only getting shut down when the ceph task unwinds, which is after the exec call that removes the filesystem?
There's definitely nothing in the thrasher task that turns things off. I think that's what ends up generating our mirror daemons, but I'm honestly not totally sure what step is turning on the mirror daemons in this yaml construction.

Actions

Copy link

#30

Updated by Venky Shankar 9 months ago

Greg Farnum wrote:

Venky Shankar wrote:

Greg Farnum wrote:

Oh, I guess the daemons are created via the qa/suites/fs/mirror-ha/cephfs-mirror/three-per-cluster.yaml fragment. Looking at eg http://qa-proxy.ceph.com/teuthology/yuriw-2023-05-09_19:37:41-fs-wip-yuri10-testing-2023-05-08-0849-reef-distro-default-smithi/7268721/config.yaml the ordering of the overall yaml seems okay?

Yes. The thrasher just does: pick-some, kill, restart until its told to stop. Doesn't care about leaving the mirror daemons in any particular state.

But I worry that maybe the mirror daemons are getting disassociated somehow and not signaled to shut down at all?

This isn't the case as far as I can see from the logs. The mirror daemon which failed to shut down was waiting for a (replayer) thread to exit, but that thread was stuck in a file system call since the file system was removed underneath.

Isn't that because the mirror daemon is only getting shut down when the ceph task unwinds, which is after the exec call that removes the filesystem?

The file systems are removed in full_sequential_finally:

  - full_sequential_finally:
    - exec:
        mon.a:
          - ceph config set mon mon_allow_pool_delete true
          - ceph fs volume rm dc --yes-i-really-mean-it
          - ceph fs volume rm dc-backup --yes-i-really-mean-it

The mirror daemon receive the updated FSMap and then since the filesystem is gone, it tries to disable mirroring - ClusterWatcher::handle_fsmap().

There's definitely nothing in the thrasher task that turns things off. I think that's what ends up generating our mirror daemons, but I'm honestly not totally sure what step is turning on the mirror daemons in this yaml construction.

Mirror daemons are run from this yaml

qa/suites/fs/mirror-ha/cephfs-mirror/three-per-cluster.yaml

When the `cephfs-mirror` task unwinds, that would stop the mirror daemon and that's fine. But, in one of the logs that Jos shared, `full_sequential_finally` was executed first - maybe since the HA workunit finished and `full_sequential_finally` is in that workunit. So the order in which the tasks unwind matter, but I'm not sure if the task unwind is done in sequence or in parallel.

Actions

Copy link

#31

Updated by Venky Shankar 9 months ago

So, here is the order of tasks unwinding:

HA workunit finishes:

2023-05-09T20:24:41.465 INFO:tasks.workunit:Stopping ['fs/cephfs_mirror_ha_verify.sh'] on client.2...

Thrasher task unwinds:

2023-05-09T20:24:41.734 DEBUG:teuthology.run_tasks:Unwinding manager cephfs_mirror_thrash
2023-05-09T20:24:41.750 INFO:tasks.cephfs_mirror_thrash:joining cephfs_mirror_thrash
2023-05-09T20:24:41.750 INFO:tasks.cephfs_mirror_thrash:stat['kill'] = 26
2023-05-09T20:24:41.750 INFO:tasks.cephfs_mirror_thrash:done joining

ceph-fuse task unwinds:

2023-05-09T20:24:41.750 DEBUG:teuthology.run_tasks:Unwinding manager ceph-fuse
2023-05-09T20:24:41.766 INFO:tasks.ceph_fuse:Unmounting ceph-fuse clients...

full_sequential_finally:

2023-05-09T20:25:08.527 INFO:teuthology.task.full_sequential_finally:In full_sequential_finally, running task exec...
2023-05-09T20:25:08.527 INFO:teuthology.task.exec:Executing custom commands...
2023-05-09T20:25:08.528 INFO:teuthology.task.exec:Running commands on role mon.a host ubuntu@smithi099.front.sepia.ceph.com
2023-05-09T20:25:08.528 DEBUG:teuthology.orchestra.run.smithi099:> sudo TESTDIR=/home/ubuntu/cephtest bash -c 'ceph config set mon mon_allow_pool_delete true'
2023-05-09T20:25:08.873 DEBUG:teuthology.orchestra.run.smithi099:> sudo TESTDIR=/home/ubuntu/cephtest bash -c 'ceph fs volume rm dc --yes-i-really-mean-it'

And the cephfs-mirror task unwinds after the timeout

2023-05-09T20:25:16.970 DEBUG:teuthology.run_tasks:Unwinding manager cephfs-mirror

Actions

Copy link

#32

Updated by Venky Shankar 9 months ago

Jos, as per https://tracker.ceph.com/issues/61182#note-31, please check if the volume deletions (and probably creation) can be moved to another existing yaml so that it does not cause the issues we are running into.

Actions

Copy link

#33