Bug #51964
openqa: test_cephfs_mirror_restart_sync_on_blocklist failure
0%
Description
2021-07-29T07:47:34.804 INFO:tasks.cephfs_test_runner:====================================================================== 2021-07-29T07:47:34.804 INFO:tasks.cephfs_test_runner:ERROR: test_cephfs_mirror_restart_sync_on_blocklist (tasks.cephfs.test_mirroring.TestMirroring) 2021-07-29T07:47:34.805 INFO:tasks.cephfs_test_runner:---------------------------------------------------------------------- 2021-07-29T07:47:34.805 INFO:tasks.cephfs_test_runner:Traceback (most recent call last): 2021-07-29T07:47:34.805 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_ceph-c_c098ff00fff39aa861a531e01a5abb127a724622/qa/tasks/cephfs/test_mirroring.py", line 627, in test_cephfs_mirror_restart_sync_on_blocklist 2021-07-29T07:47:34.806 INFO:tasks.cephfs_test_runner: "client.mirror_remote@ceph", '/d0', 'snap0', expected_snap_count=1) 2021-07-29T07:47:34.806 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_ceph-c_c098ff00fff39aa861a531e01a5abb127a724622/qa/tasks/cephfs/test_mirroring.py", line 142, in check_peer_status 2021-07-29T07:47:34.806 INFO:tasks.cephfs_test_runner: f'{fs_name}@{fs_id}', peer_uuid) 2021-07-29T07:47:34.807 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_ceph-c_c098ff00fff39aa861a531e01a5abb127a724622/qa/tasks/cephfs/test_mirroring.py", line 231, in mirror_daemon_command 2021-07-29T07:47:34.807 INFO:tasks.cephfs_test_runner: check_status=True, label=cmd_label) 2021-07-29T07:47:34.808 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_git_teuthology_73aa7e3b960c7ffac669297b6aa86606265edd1b/teuthology/orchestra/remote.py", line 509, in run 2021-07-29T07:47:34.808 INFO:tasks.cephfs_test_runner: r = self._runner(client=self.ssh, name=self.shortname, **kwargs) 2021-07-29T07:47:34.809 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_git_teuthology_73aa7e3b960c7ffac669297b6aa86606265edd1b/teuthology/orchestra/run.py", line 455, in run 2021-07-29T07:47:34.809 INFO:tasks.cephfs_test_runner: r.wait() 2021-07-29T07:47:34.810 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_git_teuthology_73aa7e3b960c7ffac669297b6aa86606265edd1b/teuthology/orchestra/run.py", line 161, in wait 2021-07-29T07:47:34.811 INFO:tasks.cephfs_test_runner: self._raise_for_status() 2021-07-29T07:47:34.811 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_git_teuthology_73aa7e3b960c7ffac669297b6aa86606265edd1b/teuthology/orchestra/run.py", line 183, in _raise_for_status 2021-07-29T07:47:34.812 INFO:tasks.cephfs_test_runner: node=self.hostname, label=self.label 2021-07-29T07:47:34.812 INFO:tasks.cephfs_test_runner:teuthology.exceptions.CommandFailedError: Command failed (peer status for fs: cephfs) on smithi042 with status 22: 'ceph --admin-daemon /var/run/ceph/cephfs-mirror.asok fs mirror peer status cephfs@30 96c863b2-6936-4726-a484-146c9382a70b'
From: /ceph/teuthology-archive/pdonnell-2021-07-29_05:54:43-fs-wip-pdonnell-testing-20210729.025313-distro-basic-smithi/6300469/teuthology.log
Updated by Patrick Donnelly over 2 years ago
/ceph/teuthology-archive/pdonnell-2021-11-04_15:43:53-fs-wip-pdonnell-testing-20211103.023355-distro-basic-smithi/6485100/teuthology.log
Updated by Kotresh Hiremath Ravishankar over 2 years ago
Also seen in this run
/ceph/teuthology-archive/yuriw-2021-11-08_15:19:37-fs-wip-yuri2-testing-2021-11-06-1322-pacific-distro-basic-smithi/6491186/teuthology.log
Updated by Venky Shankar over 2 years ago
The mirror daemon is stuck at mounting the file system::
021-12-22T16:35:57.556+0000 10697700 20 cephfs::mirror::Utils connect: connecting to cluster=ceph, client=client.mirror, mon_host= 2021-12-22T16:35:57.894+0000 10697700 10 cephfs::mirror::Utils connect: using mon addr=172.21.15.164 2021-12-22T16:35:58.029+0000 10697700 -1 asok(0xa0c6050) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/cephfs-mirror.asok': (17) File exists 2021-12-22T16:35:58.202+0000 10697700 10 cephfs::mirror::Utils connect: connected to cluster=ceph using client=client.mirror 2021-12-22T16:35:58.224+0000 10697700 20 cephfs::mirror::Utils mount: filesystem={fscid=30, fs_name=cephfs} 2021-12-22T16:35:58.558+0000 14ea0700 20 cephfs::mirror::ServiceDaemon: 0xfc08d10 update_status: 1 filesystem(s) 2021-12-22T16:35:58.610+0000 10e98700 20 cephfs::mirror::Mirror update_fs_mirrors
... ceph_mount() never returned - will rerun with "debug client = 20".
Updated by Venky Shankar over 2 years ago
Unable to hit this consistently: https://pulpito.ceph.com/vshankar-2022-01-05_05:43:45-fs-wip-vshankar-testing-20220103-142738-testing-default-smithi/
We should probably include "debug client: 20" for cephfs-mirror tests and so that we have debug logs to look at when we hit this again.
Updated by Kotresh Hiremath Ravishankar over 1 year ago
Updated by Rishabh Dave over 1 year ago
Updated by Kotresh Hiremath Ravishankar over 1 year ago
Updated by Venky Shankar over 1 year ago
- Category set to Testing
- Priority changed from High to Low
- Target version set to v18.0.0
- Backport changed from pacific to pacific,quincy
- Labels (FS) qa added
Lowering priority -- this is an issue with the test case rather than a bug in cephfs-mirror.
Updated by Laura Flores about 1 year ago
- Has duplicate Bug #58656: qa: Test failure: test_cephfs_mirror_restart_sync_on_blocklist (tasks.cephfs.test_mirroring.TestMirroring) added
Updated by Laura Flores about 1 year ago
- Translation missing: en.field_tag_list set to test-failure
Updated by Laura Flores about 1 year ago
/a/yuriw-2023-03-27_23:05:54-rados-wip-yuri4-testing-2023-03-25-0714-distro-default-smithi/7221904
Updated by Laura Flores about 1 year ago
/a/yuriw-2023-03-14_20:10:47-rados-wip-yuri-testing-2023-03-14-0714-reef-distro-default-smithi/7206976
Updated by Laura Flores about 1 year ago
/a/yuriw-2023-03-30_21:29:24-rados-wip-yuri2-testing-2023-03-30-0826-distro-default-smithi/7227398
Updated by Laura Flores 11 months ago
- Backport changed from pacific,quincy to pacific,quincy.reef
Updated by Venky Shankar 9 months ago
- Assignee changed from Venky Shankar to Jos Collin
Jos, please take this one.
Updated by Jos Collin 9 months ago
@Venky:
This bug couldn't be reproduced on main with consecutive runs of test_cephfs_mirror_restart_sync_on_blocklist.
10 consecutive runs passed:
http://pulpito.front.sepia.ceph.com/jcollin-2023-07-31_05:17:33-fs:mirror-wip-cephfs-mirror-fix_main-distro-default-smithi/
100 consecutive runs passed:
http://pulpito.front.sepia.ceph.com/jcollin-2023-08-07_09:13:25-fs:mirror-wip-cephfs-mirror-fix_main-distro-default-smithi/
Couldn't reproduce it on a full tasks.cephfs.test_mirroring.TestMirroring run (25 runs consecutively):
http://pulpito.front.sepia.ceph.com/jcollin-2023-08-08_02:10:15-fs:mirror-main-distro-default-smithi/
Updated by Venky Shankar 9 months ago
Jos Collin wrote:
@Venky:
This bug couldn't be reproduced on main with consecutive runs of test_cephfs_mirror_restart_sync_on_blocklist.
10 consecutive runs passed:
http://pulpito.front.sepia.ceph.com/jcollin-2023-07-31_05:17:33-fs:mirror-wip-cephfs-mirror-fix_main-distro-default-smithi/100 consecutive runs passed:
http://pulpito.front.sepia.ceph.com/jcollin-2023-08-07_09:13:25-fs:mirror-wip-cephfs-mirror-fix_main-distro-default-smithi/Couldn't reproduce it on a full tasks.cephfs.test_mirroring.TestMirroring run (25 runs consecutively):
http://pulpito.front.sepia.ceph.com/jcollin-2023-08-08_02:10:15-fs:mirror-main-distro-default-smithi/
The failures are intermittent. The latest failed test run is - https://pulpito.ceph.com/vshankar-2023-07-26_04:54:56-fs-wip-vshankar-testing-20230725.053049-testing-default-smithi/7352537/
We should be able to get some hints from this run.
Updated by Jos Collin 9 months ago
In the run https://pulpito.ceph.com/vshankar-2023-07-26_04:54:56-fs-wip-vshankar-testing-20230725.053049-testing-default-smithi/7352537/, as talked over gchat, please check ceph-client.mirror.95060.log file. After the client blocklisted, the filesystem (fscid=12, fs_name=cephfs) got removed showing "update_status: 0 filesystem(s)".
1538 2023-07-26T05:53:49.469+0000 8cd4640 5 cephfs::mirror::ClusterWatcher handle_fsmap: mirroring enabled=[], mirroring_disabled=[{fscid=12, fs_name=cephfs}] 1539 2023-07-26T05:53:49.469+0000 8cd4640 10 cephfs::mirror::ServiceDaemon: 0x8fe9d50 remove_filesystem: fscid=12 1540 2023-07-26T05:53:49.473+0000 8cd4640 10 cephfs::mirror::ServiceDaemon: 0x8fe9d50 schedule_update_status 1541 2023-07-26T05:53:49.475+0000 8cd4640 10 cephfs::mirror::Mirror mirroring_disabled: filesystem={fscid=12, fs_name=cephfs} 1542 2023-07-26T05:53:49.475+0000 8cd4640 5 cephfs::mirror::ClusterWatcher handle_fsmap: peers added={}, peers removed={} 1543 2023-07-26T05:53:49.696+0000 189b1640 20 cephfs::mirror::Watcher handle_notify: notify_id=652835029106, handle=326732576, notifier_id=6609 1544 2023-07-26T05:53:49.696+0000 189b1640 20 cephfs::mirror::MirrorWatcher handle_notify 1545 2023-07-26T05:53:50.473+0000 110e4640 20 cephfs::mirror::ServiceDaemon: 0x8fe9d50 update_status: 0 filesystem(s)
This is the weird thing noticed.
Then after "scheduling fs mirror update", the filesystem `cephfs` comes back and mirroring enabled with a new fscid:
1817 2023-07-26T05:57:02.570+0000 8cd4640 5 cephfs::mirror::ClusterWatcher handle_fsmap: mirroring enabled=[{fscid=14, fs_name=cephfs}], mirroring_disabled=[] 1818 2023-07-26T05:57:02.570+0000 8cd4640 10 cephfs::mirror::ServiceDaemon: 0x8fe9d50 add_filesystem: fscid=14, fs_name=cephfs 1819 2023-07-26T05:57:02.570+0000 8cd4640 10 cephfs::mirror::ServiceDaemon: 0x8fe9d50 schedule_update_status 1820 2023-07-26T05:57:02.570+0000 8cd4640 10 cephfs::mirror::Mirror mirroring_enabled: filesystem={fscid=14, fs_name=cephfs}, pool_id=29 1821 2023-07-26T05:57:02.570+0000 8cd4640 5 cephfs::mirror::ClusterWatcher handle_fsmap: peers added={}, peers removed={} 1822 2023-07-26T05:57:02.925+0000 d0dc640 20 cephfs::mirror::Mirror update_fs_mirrors 1823 2023-07-26T05:57:02.925+0000 d0dc640 10 cephfs::mirror::Mirror enable_mirroring: starting FSMirror: filesystem={fscid=14, fs_name=cephfs} 1824 2023-07-26T05:57:02.925+0000 d0dc640 10 cephfs::mirror::ServiceDaemon: 0x8fe9d50 add_or_update_fs_attribute: fscid=14 1825 2023-07-26T05:57:02.925+0000 d0dc640 10 cephfs::mirror::ServiceDaemon: 0x8fe9d50 schedule_update_status 1826 2023-07-26T05:57:02.925+0000 d0dc640 20 cephfs::mirror::FSMirror init
The test doesn't seem passing `check_peer_status()`.
Updated by Venky Shankar 9 months ago
Hi Jos,
Jos Collin wrote:
In the run https://pulpito.ceph.com/vshankar-2023-07-26_04:54:56-fs-wip-vshankar-testing-20230725.053049-testing-default-smithi/7352537/, as talked over gchat, please check ceph-client.mirror.95060.log file. After the client blocklisted, the filesystem (fscid=12, fs_name=cephfs) got removed showing "update_status: 0 filesystem(s)".
I don't think this is the issue although I suspected it.
[...]
This is the weird thing noticed.
Then after "scheduling fs mirror update", the filesystem `cephfs` comes back and mirroring enabled with a new fscid:
[...]The test doesn't seem passing `check_peer_status()`.
The actual issue is this
2023-07-26T06:46:58.510+0000 d0dc640 20 cephfs::mirror::FSMirror shutdown_mirror_watcher 2023-07-26T06:46:58.513+0000 d0dc640 20 cephfs::mirror::MirrorWatcher shutdown 2023-07-26T06:46:58.513+0000 d0dc640 20 cephfs::mirror::MirrorWatcher unregister_watcher 2023-07-26T06:46:58.513+0000 d0dc640 20 cephfs::mirror::Watcher unregister_watch 2023-07-26T06:46:58.513+0000 d0dc640 20 cephfs::mirror::Mirror schedule_mirror_update_task: scheduling fs mirror update (0x72287f0) after 2 seconds 2023-07-26T06:46:58.514+0000 c8db640 20 cephfs::mirror::MirrorWatcher handle_unregister_watcher: r=0 2023-07-26T06:46:58.514+0000 c8db640 20 cephfs::mirror::FSMirror handle_shutdown_mirror_watcher: r=0 2023-07-26T06:46:58.514+0000 c8db640 20 cephfs::mirror::FSMirror shutdown_instance_watcher 2023-07-26T06:46:58.514+0000 c8db640 20 cephfs::mirror::InstanceWatcher shutdown 2023-07-26T06:46:58.514+0000 c8db640 20 cephfs::mirror::InstanceWatcher unregister_watcher 2023-07-26T06:46:58.514+0000 c8db640 20 cephfs::mirror::Watcher unregister_watch 2023-07-26T06:46:58.514+0000 c8db640 20 cephfs::mirror::InstanceWatcher handle_unregister_watcher: r=0 2023-07-26T06:46:58.514+0000 c8db640 20 cephfs::mirror::InstanceWatcher remove_instance 2023-07-26T06:46:58.516+0000 189b1640 20 cephfs::mirror::InstanceWatcher handle_remove_instance: r=-108 2023-07-26T06:46:58.516+0000 c8db640 20 cephfs::mirror::FSMirror handle_shutdown_instance_watcher: r=-108 2023-07-26T06:46:58.516+0000 c8db640 20 cephfs::mirror::FSMirror cleanup 2023-07-26T06:46:59.595+0000 c8db640 20 cephfs::mirror::FSMirror ~FSMirror 2023-07-26T06:46:59.595+0000 c8db640 10 cephfs::mirror::Mirror enable_mirroring: starting FSMirror: filesystem={fscid=30, fs_name=cephfs} 2023-07-26T06:46:59.596+0000 c8db640 10 cephfs::mirror::ServiceDaemon: 0x8fe9d50 add_or_update_fs_attribute: fscid=30 2023-07-26T06:46:59.596+0000 c8db640 10 cephfs::mirror::ServiceDaemon: 0x8fe9d50 schedule_update_status 2023-07-26T06:46:59.596+0000 c8db640 20 cephfs::mirror::FSMirror init 2023-07-26T06:46:59.596+0000 c8db640 20 cephfs::mirror::Utils connect: connecting to cluster=ceph, client=client.mirror, mon_host= 2023-07-26T06:46:59.955+0000 c8db640 10 cephfs::mirror::Utils connect: using mon addr=172.21.15.162 2023-07-26T06:47:00.080+0000 c8db640 -1 asok(0x6b581f0) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/cephfs-mirror.asok': (17) File exists 2023-07-26T06:47:00.259+0000 c8db640 10 cephfs::mirror::Utils connect: connected to cluster=ceph using client=client.mirror 2023-07-26T06:47:00.317+0000 c8db640 20 cephfs::mirror::Utils mount: filesystem={fscid=30, fs_name=cephfs} 2023-07-26T06:47:00.553+0000 d0dc640 20 cephfs::mirror::Mirror update_fs_mirrors 2023-07-26T06:47:00.604+0000 110e4640 20 cephfs::mirror::ServiceDaemon: 0x8fe9d50 update_status: 1 filesystem(s)
Post blocklisting, the mirror daemon tried to reinitialize and got stuck in mounting the (local) file system. Could you check why that happened?