Bug #61182
closedqa: workloads/cephfs-mirror-ha-workunit - stopping mirror daemon after the test finishes timesout.
0%
Description
Description: fs/mirror-ha/{begin/{0-install 1-ceph 2-logrotate} cephfs-mirror/three-per-cluster clients/{mirror} cluster/{1-node} objectstore/bluestore-bitmap overrides/{whitelist_health} supported-random-distro$/{rhel_8} workloads/cephfs-mirror-ha-workunit}
The test finishes but stopping mirror daemon times out sometimes.
Failure Reason: reached maximum tries (50) after waiting for 300 seconds
2023-05-09T20:42:21.156 ERROR:teuthology.contextutil:Saw exception from nested tasks Traceback (most recent call last): File "/home/teuthworker/src/git.ceph.com_teuthology_69caab8ec6f44c61c991ae693e8e7b8cd859a896/teuthology/contextutil.py", line 33, in nested yield vars File "/home/teuthworker/src/github.com_ceph_ceph-c_6c0cdcda80131534ed32c70076ff630972dadd4f/qa/tasks/ceph.py", line 1935, in task yield File "/home/teuthworker/src/git.ceph.com_teuthology_69caab8ec6f44c61c991ae693e8e7b8cd859a896/teuthology/run_tasks.py", line 192, in run_tasks suppress = manager.__exit__(*exc_info) File "/home/teuthworker/src/git.ceph.com_teuthology_69caab8ec6f44c61c991ae693e8e7b8cd859a896/teuthology/task/__init__.py", line 132, in __exit__ self.end() File "/home/teuthworker/src/github.com_ceph_ceph-c_6c0cdcda80131534ed32c70076ff630972dadd4f/qa/tasks/cephfs_mirror.py", line 70, in end mirror_daemon.stop() File "/home/teuthworker/src/git.ceph.com_teuthology_69caab8ec6f44c61c991ae693e8e7b8cd859a896/teuthology/orchestra/daemon/state.py", line 139, in stop run.wait([self.proc], timeout=timeout) File "/home/teuthworker/src/git.ceph.com_teuthology_69caab8ec6f44c61c991ae693e8e7b8cd859a896/teuthology/orchestra/run.py", line 473, in wait check_time() File "/home/teuthworker/src/git.ceph.com_teuthology_69caab8ec6f44c61c991ae693e8e7b8cd859a896/teuthology/contextutil.py", line 135, in __call__ raise MaxWhileTries(error_msg) teuthology.exceptions.MaxWhileTries: reached maximum tries (50) after waiting for 300 seconds
Files
Updated by Kotresh Hiremath Ravishankar 12 months ago
- Description updated (diff)
Updated by Kotresh Hiremath Ravishankar 12 months ago
- Description updated (diff)
Updated by Kotresh Hiremath Ravishankar 12 months ago
Updated by Laura Flores 12 months ago
- Translation missing: en.field_tag_list set to test-failure
Updated by Jos Collin 10 months ago
- Status changed from In Progress to Can't reproduce
This doesn't happen consistently in main and in reef branch. I ran the fs:mirror-ha test suite multiple times on the main branch and the reef branch. It fails sometimes with the below exception, but passes in the next run.
raise MaxWhileTries(error_msg) teuthology.exceptions.MaxWhileTries: reached maximum tries (50) after waiting for 300 seconds
Passes here on main branch:
http://pulpito.front.sepia.ceph.com/jcollin-2023-07-06_14:22:42-fs:mirror-ha-wip-cephfs-mirror-fix_main-distro-default-smithi/
Then fails in the very next run:
http://pulpito.front.sepia.ceph.com/jcollin-2023-07-07_03:44:39-fs:mirror-ha-wip-cephfs-mirror-fix_main-distro-default-smithi/
Fails on the reef branch:
http://pulpito.front.sepia.ceph.com/jcollin-2023-07-07_01:21:59-fs:mirror-ha-wip-cephfs-mirror-fix_reef-distro-default-smithi/
Then passed in the next run(s):
http://pulpito.front.sepia.ceph.com/jcollin-2023-07-07_08:55:00-fs:mirror-ha-wip-cephfs-mirror-fix_reef-distro-default-smithi/
http://pulpito.front.sepia.ceph.com/jcollin-2023-07-07_09:36:43-fs:mirror-ha-wip-cephfs-mirror-fix_reef-distro-default-smithi/
http://pulpito.front.sepia.ceph.com/jcollin-2023-07-07_10:15:52-fs:mirror-ha-wip-cephfs-mirror-fix_reef-distro-default-smithi/
So it's not consistently reproducible.
Updated by Jos Collin 10 months ago
- Status changed from Can't reproduce to In Progress
In quincy branch, this is consistently reproducible:
http://pulpito.front.sepia.ceph.com/jcollin-2023-07-10_04:22:36-fs:mirror-ha-wip-cephfs-mirror-fix_quincy-distro-default-smithi/
http://pulpito.front.sepia.ceph.com/jcollin-2023-07-10_05:19:42-fs:mirror-ha-wip-cephfs-mirror-fix_quincy-distro-default-smithi/
http://pulpito.front.sepia.ceph.com/jcollin-2023-07-10_05:21:33-fs:mirror-ha-wip-cephfs-mirror-fix_quincy-distro-default-smithi/
http://pulpito.front.sepia.ceph.com/jcollin-2023-07-10_06:09:34-fs:mirror-ha-wip-cephfs-mirror-fix_quincy-distro-default-smithi/
These 4 consecutive runs failed.
Moving this ticket to 'in progress'.
Updated by Venky Shankar 10 months ago
Jos Collin wrote:
In quincy branch, this is consistently reproducible:
http://pulpito.front.sepia.ceph.com/jcollin-2023-07-10_04:22:36-fs:mirror-ha-wip-cephfs-mirror-fix_quincy-distro-default-smithi/
http://pulpito.front.sepia.ceph.com/jcollin-2023-07-10_05:19:42-fs:mirror-ha-wip-cephfs-mirror-fix_quincy-distro-default-smithi/
http://pulpito.front.sepia.ceph.com/jcollin-2023-07-10_05:21:33-fs:mirror-ha-wip-cephfs-mirror-fix_quincy-distro-default-smithi/
http://pulpito.front.sepia.ceph.com/jcollin-2023-07-10_06:09:34-fs:mirror-ha-wip-cephfs-mirror-fix_quincy-distro-default-smithi/These 4 consecutive runs failed.
Moving this ticket to 'in progress'.
As discussed, I think the pools are removed and the mirror daemon has a watch on objects in the pool. This could be causing the watch() call to return with an error that is probably not being handled by the mirror daemon.
Updated by Venky Shankar 10 months ago
- Target version set to v19.0.0
- Backport set to reef,quincy,pacific
Updated by Jos Collin 10 months ago
@Venky,
As discussed attaching job [1] and the mirror daemon log, which I've been referring to.
Updated by Venky Shankar 10 months ago
Jos Collin wrote:
@Venky,
As discussed attaching job [1] and the mirror daemon log, which I've been referring to.
ACK - will have a look.
Updated by Rishabh Dave 10 months ago
rishabh-2023-07-06_19:46:43-fs-wip-rishabh-CephManager-in-CephFSTestCase-testing-default-smithi/7328210/
Updated by Jos Collin 10 months ago
Rishabh Dave wrote:
rishabh-2023-07-06_19:46:43-fs-wip-rishabh-CephManager-in-CephFSTestCase-testing-default-smithi/7328210/
Only the quincy runs seems consistently failing.
Updated by Venky Shankar 10 months ago
Jos Collin wrote:
@Venky,
As discussed attaching job [1] and the mirror daemon log, which I've been referring to.
The mirror daemon (log) which did not shutdown is
./remote/smithi061/log/ceph-client.mirror2.42402.log.gz
Updated by Venky Shankar 10 months ago
Out of the 3 replayer threads, only two exited when the mirror daemon was shutting down:
zgrep "\: exiting" ./remote/smithi061/log/ceph-client.mirror2.42402.log.gz 2023-07-10T04:56:12.686+0000 7f677b6a9700 5 cephfs::mirror::PeerReplayer(e04e5833-38c2-4c17-bcee-9ec7c5429e9f) run: exiting 2023-07-10T04:56:12.686+0000 7f677c6ab700 5 cephfs::mirror::PeerReplayer(e04e5833-38c2-4c17-bcee-9ec7c5429e9f) run: exiting
The pending thread never exited, causing the shutdown to never finish.
Updated by Venky Shankar 10 months ago
The thread (7f677beaa700) was blocked on a file system call to build snap mapping (local vs remote)
2023-07-10T04:56:11.703+0000 7f677beaa700 20 cephfs::mirror::PeerReplayer(e04e5833-38c2-4c17-bcee-9ec7c5429e9f) run: trying to pick from 1 directories 2023-07-10T04:56:11.703+0000 7f677beaa700 20 cephfs::mirror::PeerReplayer(e04e5833-38c2-4c17-bcee-9ec7c5429e9f) pick_directory 2023-07-10T04:56:11.703+0000 7f677beaa700 5 cephfs::mirror::PeerReplayer(e04e5833-38c2-4c17-bcee-9ec7c5429e9f) run: picked dir_root=/mirror/ceph_repo/ceph-qa-suite_1 2023-07-10T04:56:11.703+0000 7f677beaa700 20 cephfs::mirror::PeerReplayer(e04e5833-38c2-4c17-bcee-9ec7c5429e9f) register_directory: dir_root=/mirror/ceph_repo/ceph-qa-suite_1 2023-07-10T04:56:11.703+0000 7f677beaa700 20 cephfs::mirror::PeerReplayer(e04e5833-38c2-4c17-bcee-9ec7c5429e9f) try_lock_directory: dir_root=/mirror/ceph_repo/ceph-qa-suite_1 2023-07-10T04:56:11.703+0000 7f677beaa700 10 cephfs::mirror::PeerReplayer(e04e5833-38c2-4c17-bcee-9ec7c5429e9f) try_lock_directory: dir_root=/mirror/ceph_repo/ceph-qa-suite_1 locked 2023-07-10T04:56:11.703+0000 7f677beaa700 5 cephfs::mirror::PeerReplayer(e04e5833-38c2-4c17-bcee-9ec7c5429e9f) register_directory: dir_root=/mirror/ceph_repo/ceph-qa-suite_1 registered with replayer=0x555b6b9a7340 2023-07-10T04:56:11.703+0000 7f677beaa700 20 cephfs::mirror::PeerReplayer(e04e5833-38c2-4c17-bcee-9ec7c5429e9f) sync_snaps: dir_root=/mirror/ceph_repo/ceph-qa-suite_1 2023-07-10T04:56:11.703+0000 7f677beaa700 20 cephfs::mirror::PeerReplayer(e04e5833-38c2-4c17-bcee-9ec7c5429e9f) do_sync_snaps: dir_root=/mirror/ceph_repo/ceph-qa-suite_1 2023-07-10T04:56:11.703+0000 7f677beaa700 20 cephfs::mirror::PeerReplayer(e04e5833-38c2-4c17-bcee-9ec7c5429e9f) build_snap_map: dir_root=/mirror/ceph_repo/ceph-qa-suite_1, snap_dir=/mirror/ceph_repo/ceph-qa-suite_1/.snap, is_remote=0 2023-07-10T04:56:12.451+0000 7f67906d3700 20 cephfs::mirror::Watcher handle_notify: notify_id=227633267391, handle=93851095683456, notifier_id=4383
PeerReplayer::build_snap_map() calls into libcephfs. What happened is the file system got removed while the mirror daemon was making calling a file system libcephfs API. This is fine as the verification of the synchronized snapshots finished:
2023-07-10T04:55:43.696 INFO:tasks.workunit.client.2.smithi061.stderr:+ echo 'verify done!' 2023-07-10T04:55:43.696 INFO:tasks.workunit:Stopping ['fs/cephfs_mirror_ha_verify.sh'] on client.2...
2023-07-10T04:56:10.248 DEBUG:teuthology.run_tasks:Unwinding manager full_sequential_finally 2023-07-10T04:56:10.268 INFO:teuthology.task.full_sequential_finally:In full_sequential_finally, running task exec... 2023-07-10T04:56:10.269 INFO:teuthology.task.exec:Executing custom commands... 2023-07-10T04:56:10.269 INFO:teuthology.task.exec:Running commands on role mon.a host ubuntu@smithi061.front.sepia.ceph.com 2023-07-10T04:56:10.269 DEBUG:teuthology.orchestra.run.smithi061:> sudo TESTDIR=/home/ubuntu/cephtest bash -c 'ceph config set mon mon_allow_pool_delete true' 2023-07-10T04:56:10.598 DEBUG:teuthology.orchestra.run.smithi061:> sudo TESTDIR=/home/ubuntu/cephtest bash -c 'ceph fs volume rm dc --yes-i-really-mean-it' 2023-07-10T04:56:11.573 INFO:tasks.ceph.mon.a.smithi061.stderr:2023-07-10T04:56:11.567+0000 7f65e40b2700 -1 log_channel(cluster) log [ERR] : Health check failed: 1 filesystem is offline (MDS_ALL_DOWN) 2023-07-10T04:56:11.622 INFO:tasks.ceph.mds.b.smithi061.stdout:starting mds.b at 2023-07-10T04:56:14.595 INFO:teuthology.orchestra.run.smithi061.stdout:metadata pool: cephfs.dc.meta data pool: ['cephfs.dc.data'] removed 2023-07-10T04:56:14.624 DEBUG:teuthology.orchestra.run.smithi061:> sudo TESTDIR=/home/ubuntu/cephtest bash -c 'ceph fs volume rm dc-backup --yes-i-really-mean-it'
We have seen this behaviour before - if a file system is removed and a client is calling an file system related API that involves talking to MDSs, then the API call never returns. Maybe we want to handle this more gracefully in the client. I'll create a separate tracker for that.
As far as this issue is concerned, there are a couple of options:
- stop the cephfs-mirror thrasher task in cephfs-mirror-ha-workunit.yaml under full_sequential_finally section
- kill the blocked thread in cephfs-mirror daemon if the thread does not exit after it has been requested to stop
Updated by Greg Farnum 10 months ago
I talked about this with Jos today and see that when the cephfs_mirror_thrash.py joins the background thread, the do_thrash() function which turns off and then restarts mirror daemons doesn't actually stop the mirror daemons before exiting, once the stopping flag is set.
It seems like that's the root issue?
Updated by Greg Farnum 10 months ago
Oh, I guess the daemons are created via the qa/suites/fs/mirror-ha/cephfs-mirror/three-per-cluster.yaml fragment. Looking at eg http://qa-proxy.ceph.com/teuthology/yuriw-2023-05-09_19:37:41-fs-wip-yuri10-testing-2023-05-08-0849-reef-distro-default-smithi/7268721/config.yaml the ordering of the overall yaml seems okay?
But I worry that maybe the mirror daemons are getting disassociated somehow and not signaled to shut down at all?
Aha, qa/tasks/cephfs_mirror.py does not try and get the return value of the daemon. Does that mean it doesn't wait for the daemon to finish shutting down?
Updated by Venky Shankar 10 months ago
Greg Farnum wrote:
Oh, I guess the daemons are created via the qa/suites/fs/mirror-ha/cephfs-mirror/three-per-cluster.yaml fragment. Looking at eg http://qa-proxy.ceph.com/teuthology/yuriw-2023-05-09_19:37:41-fs-wip-yuri10-testing-2023-05-08-0849-reef-distro-default-smithi/7268721/config.yaml the ordering of the overall yaml seems okay?
Yes. The thrasher just does: pick-some, kill, restart until its told to stop. Doesn't care about leaving the mirror daemons in any particular state.
But I worry that maybe the mirror daemons are getting disassociated somehow and not signaled to shut down at all?
This isn't the case as far as I can see from the logs. The mirror daemon which failed to shut down was waiting for a (replayer) thread to exit, but that thread was stuck in a file system call since the file system was removed underneath.
Aha, qa/tasks/cephfs_mirror.py does not try and get the return value of the daemon. Does that mean it doesn't wait for the daemon to finish shutting down?
This task spawns a single mirror daemon and is not used to test the HA setup. The snapshot mirroring tests do all kinds of validations (matching checksumsm, etc..). The return values aren't particularly interesting since the mirror daemon basically does:
src/tools/cephfs_mirror/main.cc: return forker.signal_exit(r);
"r" would be 0 if the mirror daemon started all fine.
Updated by Jos Collin 10 months ago
Venky Shankar wrote:
Out of the 3 replayer threads, only two exited when the mirror daemon was shutting down:
[...]
The pending thread never exited, causing the shutdown to never finish.
[1] http://pulpito.front.sepia.ceph.com/jcollin-2023-07-14_08:13:59-fs:mirror-ha-wip-cephfs-mirror-fix_reef-distro-default-smithi/7337362
[2] http://pulpito.front.sepia.ceph.com/jcollin-2023-07-14_08:14:24-fs:mirror-ha-wip-cephfs-mirror-fix_reef-distro-default-smithi/7337368
I got an exception here^. All 3 replayer threads seems exited. But the job failed with the same error.
Updated by Venky Shankar 10 months ago
Jos Collin wrote:
Venky Shankar wrote:
Out of the 3 replayer threads, only two exited when the mirror daemon was shutting down:
[...]
The pending thread never exited, causing the shutdown to never finish.
$ zegrep '7fddbe350700|7fddbdb4f700|7fddbd34e700' ./remote/smithi081/log/ceph-client.mirror2.130313.log.gz
None of the 3 replayer threads exited.
Updated by Venky Shankar 10 months ago
From GChat:
you could introduce a config variable to kill the threads during shutdown if they are blocked for more then say 5 minutes then, enable this config when running in teuthology so, that the mirror daemon would kill off those blocked threads
This needs to be enabled for teuthology tests only as a workaround. Normally, users would disable filesystem mirroring when deleteing file systems, and/or we can add guard to do switch it off before removing file systems.
Updated by Greg Farnum 10 months ago
Venky Shankar wrote:
Greg Farnum wrote:
Oh, I guess the daemons are created via the qa/suites/fs/mirror-ha/cephfs-mirror/three-per-cluster.yaml fragment. Looking at eg http://qa-proxy.ceph.com/teuthology/yuriw-2023-05-09_19:37:41-fs-wip-yuri10-testing-2023-05-08-0849-reef-distro-default-smithi/7268721/config.yaml the ordering of the overall yaml seems okay?
Yes. The thrasher just does: pick-some, kill, restart until its told to stop. Doesn't care about leaving the mirror daemons in any particular state.
But I worry that maybe the mirror daemons are getting disassociated somehow and not signaled to shut down at all?
This isn't the case as far as I can see from the logs. The mirror daemon which failed to shut down was waiting for a (replayer) thread to exit, but that thread was stuck in a file system call since the file system was removed underneath.
Isn't that because the mirror daemon is only getting shut down when the ceph task unwinds, which is after the exec call that removes the filesystem?
There's definitely nothing in the thrasher task that turns things off. I think that's what ends up generating our mirror daemons, but I'm honestly not totally sure what step is turning on the mirror daemons in this yaml construction.
Updated by Venky Shankar 10 months ago
Greg Farnum wrote:
Venky Shankar wrote:
Greg Farnum wrote:
Oh, I guess the daemons are created via the qa/suites/fs/mirror-ha/cephfs-mirror/three-per-cluster.yaml fragment. Looking at eg http://qa-proxy.ceph.com/teuthology/yuriw-2023-05-09_19:37:41-fs-wip-yuri10-testing-2023-05-08-0849-reef-distro-default-smithi/7268721/config.yaml the ordering of the overall yaml seems okay?
Yes. The thrasher just does: pick-some, kill, restart until its told to stop. Doesn't care about leaving the mirror daemons in any particular state.
But I worry that maybe the mirror daemons are getting disassociated somehow and not signaled to shut down at all?
This isn't the case as far as I can see from the logs. The mirror daemon which failed to shut down was waiting for a (replayer) thread to exit, but that thread was stuck in a file system call since the file system was removed underneath.
Isn't that because the mirror daemon is only getting shut down when the ceph task unwinds, which is after the exec call that removes the filesystem?
The file systems are removed in full_sequential_finally:
- full_sequential_finally: - exec: mon.a: - ceph config set mon mon_allow_pool_delete true - ceph fs volume rm dc --yes-i-really-mean-it - ceph fs volume rm dc-backup --yes-i-really-mean-it
The mirror daemon receive the updated FSMap and then since the filesystem is gone, it tries to disable mirroring - ClusterWatcher::handle_fsmap().
There's definitely nothing in the thrasher task that turns things off. I think that's what ends up generating our mirror daemons, but I'm honestly not totally sure what step is turning on the mirror daemons in this yaml construction.
Mirror daemons are run from this yaml
qa/suites/fs/mirror-ha/cephfs-mirror/three-per-cluster.yaml
When the `cephfs-mirror` task unwinds, that would stop the mirror daemon and that's fine. But, in one of the logs that Jos shared, `full_sequential_finally` was executed first - maybe since the HA workunit finished and `full_sequential_finally` is in that workunit. So the order in which the tasks unwind matter, but I'm not sure if the task unwind is done in sequence or in parallel.
Updated by Venky Shankar 10 months ago
So, here is the order of tasks unwinding:
HA workunit finishes:
2023-05-09T20:24:41.465 INFO:tasks.workunit:Stopping ['fs/cephfs_mirror_ha_verify.sh'] on client.2...
Thrasher task unwinds:
2023-05-09T20:24:41.734 DEBUG:teuthology.run_tasks:Unwinding manager cephfs_mirror_thrash 2023-05-09T20:24:41.750 INFO:tasks.cephfs_mirror_thrash:joining cephfs_mirror_thrash 2023-05-09T20:24:41.750 INFO:tasks.cephfs_mirror_thrash:stat['kill'] = 26 2023-05-09T20:24:41.750 INFO:tasks.cephfs_mirror_thrash:done joining
ceph-fuse task unwinds:
2023-05-09T20:24:41.750 DEBUG:teuthology.run_tasks:Unwinding manager ceph-fuse 2023-05-09T20:24:41.766 INFO:tasks.ceph_fuse:Unmounting ceph-fuse clients...
full_sequential_finally:
2023-05-09T20:25:08.527 INFO:teuthology.task.full_sequential_finally:In full_sequential_finally, running task exec... 2023-05-09T20:25:08.527 INFO:teuthology.task.exec:Executing custom commands... 2023-05-09T20:25:08.528 INFO:teuthology.task.exec:Running commands on role mon.a host ubuntu@smithi099.front.sepia.ceph.com 2023-05-09T20:25:08.528 DEBUG:teuthology.orchestra.run.smithi099:> sudo TESTDIR=/home/ubuntu/cephtest bash -c 'ceph config set mon mon_allow_pool_delete true' 2023-05-09T20:25:08.873 DEBUG:teuthology.orchestra.run.smithi099:> sudo TESTDIR=/home/ubuntu/cephtest bash -c 'ceph fs volume rm dc --yes-i-really-mean-it'
And the cephfs-mirror task unwinds after the timeout
2023-05-09T20:25:16.970 DEBUG:teuthology.run_tasks:Unwinding manager cephfs-mirror
Updated by Venky Shankar 10 months ago
Jos, as per https://tracker.ceph.com/issues/61182#note-31, please check if the volume deletions (and probably creation) can be moved to another existing yaml so that it does not cause the issues we are running into.
Updated by Jos Collin 10 months ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 52600
Updated by Venky Shankar 9 months ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot 9 months ago
- Copied to Backport #62176: quincy: qa: workloads/cephfs-mirror-ha-workunit - stopping mirror daemon after the test finishes timesout. added
Updated by Backport Bot 9 months ago
- Copied to Backport #62177: pacific: qa: workloads/cephfs-mirror-ha-workunit - stopping mirror daemon after the test finishes timesout. added
Updated by Backport Bot 9 months ago
- Copied to Backport #62178: reef: qa: workloads/cephfs-mirror-ha-workunit - stopping mirror daemon after the test finishes timesout. added
Updated by Jos Collin 2 months ago
- Status changed from Pending Backport to Resolved