Project

General

Profile

Bug #51964

qa: test_cephfs_mirror_restart_sync_on_blocklist failure

Added by Patrick Donnelly over 2 years ago. Updated about 1 month ago.

Status:
In Progress
Priority:
Low
Assignee:
Category:
Testing
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
pacific,quincy.reef
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
cephfs-mirror, qa-suite
Labels (FS):
qa, qa-failure
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2021-07-29T07:47:34.804 INFO:tasks.cephfs_test_runner:======================================================================
2021-07-29T07:47:34.804 INFO:tasks.cephfs_test_runner:ERROR: test_cephfs_mirror_restart_sync_on_blocklist (tasks.cephfs.test_mirroring.TestMirroring)
2021-07-29T07:47:34.805 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2021-07-29T07:47:34.805 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2021-07-29T07:47:34.805 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_c098ff00fff39aa861a531e01a5abb127a724622/qa/tasks/cephfs/test_mirroring.py", line 627, in test_cephfs_mirror_restart_sync_on_blocklist
2021-07-29T07:47:34.806 INFO:tasks.cephfs_test_runner:    "client.mirror_remote@ceph", '/d0', 'snap0', expected_snap_count=1)
2021-07-29T07:47:34.806 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_c098ff00fff39aa861a531e01a5abb127a724622/qa/tasks/cephfs/test_mirroring.py", line 142, in check_peer_status
2021-07-29T07:47:34.806 INFO:tasks.cephfs_test_runner:    f'{fs_name}@{fs_id}', peer_uuid)
2021-07-29T07:47:34.807 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_c098ff00fff39aa861a531e01a5abb127a724622/qa/tasks/cephfs/test_mirroring.py", line 231, in mirror_daemon_command
2021-07-29T07:47:34.807 INFO:tasks.cephfs_test_runner:    check_status=True, label=cmd_label)
2021-07-29T07:47:34.808 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_git_teuthology_73aa7e3b960c7ffac669297b6aa86606265edd1b/teuthology/orchestra/remote.py", line 509, in run
2021-07-29T07:47:34.808 INFO:tasks.cephfs_test_runner:    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
2021-07-29T07:47:34.809 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_git_teuthology_73aa7e3b960c7ffac669297b6aa86606265edd1b/teuthology/orchestra/run.py", line 455, in run
2021-07-29T07:47:34.809 INFO:tasks.cephfs_test_runner:    r.wait()
2021-07-29T07:47:34.810 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_git_teuthology_73aa7e3b960c7ffac669297b6aa86606265edd1b/teuthology/orchestra/run.py", line 161, in wait
2021-07-29T07:47:34.811 INFO:tasks.cephfs_test_runner:    self._raise_for_status()
2021-07-29T07:47:34.811 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_git_teuthology_73aa7e3b960c7ffac669297b6aa86606265edd1b/teuthology/orchestra/run.py", line 183, in _raise_for_status
2021-07-29T07:47:34.812 INFO:tasks.cephfs_test_runner:    node=self.hostname, label=self.label
2021-07-29T07:47:34.812 INFO:tasks.cephfs_test_runner:teuthology.exceptions.CommandFailedError: Command failed (peer status for fs: cephfs) on smithi042 with status 22: 'ceph --admin-daemon /var/run/ceph/cephfs-mirror.asok fs mirror peer status cephfs@30 96c863b2-6936-4726-a484-146c9382a70b'

From: /ceph/teuthology-archive/pdonnell-2021-07-29_05:54:43-fs-wip-pdonnell-testing-20210729.025313-distro-basic-smithi/6300469/teuthology.log


Related issues

Duplicated by Ceph - Bug #58656: qa: Test failure: test_cephfs_mirror_restart_sync_on_blocklist (tasks.cephfs.test_mirroring.TestMirroring) Duplicate

History

#1 Updated by Patrick Donnelly over 2 years ago

/ceph/teuthology-archive/pdonnell-2021-11-04_15:43:53-fs-wip-pdonnell-testing-20211103.023355-distro-basic-smithi/6485100/teuthology.log

#2 Updated by Kotresh Hiremath Ravishankar over 2 years ago

Also seen in this run

http://pulpito.front.sepia.ceph.com/yuriw-2021-11-08_15:19:37-fs-wip-yuri2-testing-2021-11-06-1322-pacific-distro-basic-smithi/6491186/

/ceph/teuthology-archive/yuriw-2021-11-08_15:19:37-fs-wip-yuri2-testing-2021-11-06-1322-pacific-distro-basic-smithi/6491186/teuthology.log

#3 Updated by Venky Shankar over 2 years ago

I'll take a look at the failure soon.

#4 Updated by Venky Shankar about 2 years ago

The mirror daemon is stuck at mounting the file system::

021-12-22T16:35:57.556+0000 10697700 20 cephfs::mirror::Utils connect: connecting to cluster=ceph, client=client.mirror, mon_host=
2021-12-22T16:35:57.894+0000 10697700 10 cephfs::mirror::Utils connect: using mon addr=172.21.15.164
2021-12-22T16:35:58.029+0000 10697700 -1 asok(0xa0c6050) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/cephfs-mirror.asok': (17) File exists
2021-12-22T16:35:58.202+0000 10697700 10 cephfs::mirror::Utils connect: connected to cluster=ceph using client=client.mirror
2021-12-22T16:35:58.224+0000 10697700 20 cephfs::mirror::Utils mount: filesystem={fscid=30, fs_name=cephfs}
2021-12-22T16:35:58.558+0000 14ea0700 20 cephfs::mirror::ServiceDaemon: 0xfc08d10 update_status: 1 filesystem(s)
2021-12-22T16:35:58.610+0000 10e98700 20 cephfs::mirror::Mirror update_fs_mirrors

... ceph_mount() never returned - will rerun with "debug client = 20".

#5 Updated by Venky Shankar about 2 years ago

Unable to hit this consistently: https://pulpito.ceph.com/vshankar-2022-01-05_05:43:45-fs-wip-vshankar-testing-20220103-142738-testing-default-smithi/

We should probably include "debug client: 20" for cephfs-mirror tests and so that we have debug logs to look at when we hit this again.

#6 Updated by Patrick Donnelly over 1 year ago

  • Target version deleted (v17.0.0)

#10 Updated by Venky Shankar over 1 year ago

  • Category set to Testing
  • Priority changed from High to Low
  • Target version set to v18.0.0
  • Backport changed from pacific to pacific,quincy
  • Labels (FS) qa added

Lowering priority -- this is an issue with the test case rather than a bug in cephfs-mirror.

#11 Updated by Laura Flores 12 months ago

  • Duplicated by Bug #58656: qa: Test failure: test_cephfs_mirror_restart_sync_on_blocklist (tasks.cephfs.test_mirroring.TestMirroring) added

#12 Updated by Laura Flores 12 months ago

  • Tags set to test-failure

#13 Updated by Laura Flores 11 months ago

/a/yuriw-2023-03-27_23:05:54-rados-wip-yuri4-testing-2023-03-25-0714-distro-default-smithi/7221904

#14 Updated by Laura Flores 11 months ago

/a/yuriw-2023-03-14_20:10:47-rados-wip-yuri-testing-2023-03-14-0714-reef-distro-default-smithi/7206976

#16 Updated by Laura Flores 11 months ago

/a/yuriw-2023-03-30_21:29:24-rados-wip-yuri2-testing-2023-03-30-0826-distro-default-smithi/7227398

#20 Updated by Laura Flores 9 months ago

  • Backport changed from pacific,quincy to pacific,quincy.reef

#22 Updated by Venky Shankar 7 months ago

  • Assignee changed from Venky Shankar to Jos Collin

Jos, please take this one.

#23 Updated by Jos Collin 7 months ago

  • Status changed from New to In Progress

#24 Updated by Jos Collin 7 months ago

@Venky:

This bug couldn't be reproduced on main with consecutive runs of test_cephfs_mirror_restart_sync_on_blocklist.

10 consecutive runs passed:
http://pulpito.front.sepia.ceph.com/jcollin-2023-07-31_05:17:33-fs:mirror-wip-cephfs-mirror-fix_main-distro-default-smithi/

100 consecutive runs passed:
http://pulpito.front.sepia.ceph.com/jcollin-2023-08-07_09:13:25-fs:mirror-wip-cephfs-mirror-fix_main-distro-default-smithi/

Couldn't reproduce it on a full tasks.cephfs.test_mirroring.TestMirroring run (25 runs consecutively):
http://pulpito.front.sepia.ceph.com/jcollin-2023-08-08_02:10:15-fs:mirror-main-distro-default-smithi/

#25 Updated by Venky Shankar 7 months ago

Jos Collin wrote:

@Venky:

This bug couldn't be reproduced on main with consecutive runs of test_cephfs_mirror_restart_sync_on_blocklist.

10 consecutive runs passed:
http://pulpito.front.sepia.ceph.com/jcollin-2023-07-31_05:17:33-fs:mirror-wip-cephfs-mirror-fix_main-distro-default-smithi/

100 consecutive runs passed:
http://pulpito.front.sepia.ceph.com/jcollin-2023-08-07_09:13:25-fs:mirror-wip-cephfs-mirror-fix_main-distro-default-smithi/

Couldn't reproduce it on a full tasks.cephfs.test_mirroring.TestMirroring run (25 runs consecutively):
http://pulpito.front.sepia.ceph.com/jcollin-2023-08-08_02:10:15-fs:mirror-main-distro-default-smithi/

The failures are intermittent. The latest failed test run is - https://pulpito.ceph.com/vshankar-2023-07-26_04:54:56-fs-wip-vshankar-testing-20230725.053049-testing-default-smithi/7352537/

We should be able to get some hints from this run.

#26 Updated by Jos Collin 7 months ago

In the run https://pulpito.ceph.com/vshankar-2023-07-26_04:54:56-fs-wip-vshankar-testing-20230725.053049-testing-default-smithi/7352537/, as talked over gchat, please check ceph-client.mirror.95060.log file. After the client blocklisted, the filesystem (fscid=12, fs_name=cephfs) got removed showing "update_status: 0 filesystem(s)".

 1538 2023-07-26T05:53:49.469+0000 8cd4640  5 cephfs::mirror::ClusterWatcher handle_fsmap: mirroring enabled=[], mirroring_disabled=[{fscid=12, fs_name=cephfs}]           
 1539 2023-07-26T05:53:49.469+0000 8cd4640 10 cephfs::mirror::ServiceDaemon: 0x8fe9d50 remove_filesystem: fscid=12                                                         
 1540 2023-07-26T05:53:49.473+0000 8cd4640 10 cephfs::mirror::ServiceDaemon: 0x8fe9d50 schedule_update_status                                                              
 1541 2023-07-26T05:53:49.475+0000 8cd4640 10 cephfs::mirror::Mirror mirroring_disabled: filesystem={fscid=12, fs_name=cephfs}                                             
 1542 2023-07-26T05:53:49.475+0000 8cd4640  5 cephfs::mirror::ClusterWatcher handle_fsmap: peers added={}, peers removed={}                                                
 1543 2023-07-26T05:53:49.696+0000 189b1640 20 cephfs::mirror::Watcher handle_notify: notify_id=652835029106, handle=326732576, notifier_id=6609                           
 1544 2023-07-26T05:53:49.696+0000 189b1640 20 cephfs::mirror::MirrorWatcher handle_notify                                                                                 
 1545 2023-07-26T05:53:50.473+0000 110e4640 20 cephfs::mirror::ServiceDaemon: 0x8fe9d50 update_status: 0 filesystem(s) 

This is the weird thing noticed.

Then after "scheduling fs mirror update", the filesystem `cephfs` comes back and mirroring enabled with a new fscid:

 1817 2023-07-26T05:57:02.570+0000 8cd4640  5 cephfs::mirror::ClusterWatcher handle_fsmap: mirroring enabled=[{fscid=14, fs_name=cephfs}], mirroring_disabled=[]           
 1818 2023-07-26T05:57:02.570+0000 8cd4640 10 cephfs::mirror::ServiceDaemon: 0x8fe9d50 add_filesystem: fscid=14, fs_name=cephfs                                            
 1819 2023-07-26T05:57:02.570+0000 8cd4640 10 cephfs::mirror::ServiceDaemon: 0x8fe9d50 schedule_update_status                                                              
 1820 2023-07-26T05:57:02.570+0000 8cd4640 10 cephfs::mirror::Mirror mirroring_enabled: filesystem={fscid=14, fs_name=cephfs}, pool_id=29                                  
 1821 2023-07-26T05:57:02.570+0000 8cd4640  5 cephfs::mirror::ClusterWatcher handle_fsmap: peers added={}, peers removed={}                                                
 1822 2023-07-26T05:57:02.925+0000 d0dc640 20 cephfs::mirror::Mirror update_fs_mirrors                                                                                     
 1823 2023-07-26T05:57:02.925+0000 d0dc640 10 cephfs::mirror::Mirror enable_mirroring: starting FSMirror: filesystem={fscid=14, fs_name=cephfs}                            
 1824 2023-07-26T05:57:02.925+0000 d0dc640 10 cephfs::mirror::ServiceDaemon: 0x8fe9d50 add_or_update_fs_attribute: fscid=14                                                
 1825 2023-07-26T05:57:02.925+0000 d0dc640 10 cephfs::mirror::ServiceDaemon: 0x8fe9d50 schedule_update_status                                                              
 1826 2023-07-26T05:57:02.925+0000 d0dc640 20 cephfs::mirror::FSMirror init

The test doesn't seem passing `check_peer_status()`.

#27 Updated by Venky Shankar 7 months ago

Hi Jos,

Jos Collin wrote:

In the run https://pulpito.ceph.com/vshankar-2023-07-26_04:54:56-fs-wip-vshankar-testing-20230725.053049-testing-default-smithi/7352537/, as talked over gchat, please check ceph-client.mirror.95060.log file. After the client blocklisted, the filesystem (fscid=12, fs_name=cephfs) got removed showing "update_status: 0 filesystem(s)".

I don't think this is the issue although I suspected it.

[...]

This is the weird thing noticed.

Then after "scheduling fs mirror update", the filesystem `cephfs` comes back and mirroring enabled with a new fscid:
[...]

The test doesn't seem passing `check_peer_status()`.

The actual issue is this

2023-07-26T06:46:58.510+0000 d0dc640 20 cephfs::mirror::FSMirror shutdown_mirror_watcher
2023-07-26T06:46:58.513+0000 d0dc640 20 cephfs::mirror::MirrorWatcher shutdown
2023-07-26T06:46:58.513+0000 d0dc640 20 cephfs::mirror::MirrorWatcher unregister_watcher
2023-07-26T06:46:58.513+0000 d0dc640 20 cephfs::mirror::Watcher unregister_watch
2023-07-26T06:46:58.513+0000 d0dc640 20 cephfs::mirror::Mirror schedule_mirror_update_task: scheduling fs mirror update (0x72287f0) after 2 seconds
2023-07-26T06:46:58.514+0000 c8db640 20 cephfs::mirror::MirrorWatcher handle_unregister_watcher: r=0
2023-07-26T06:46:58.514+0000 c8db640 20 cephfs::mirror::FSMirror handle_shutdown_mirror_watcher: r=0
2023-07-26T06:46:58.514+0000 c8db640 20 cephfs::mirror::FSMirror shutdown_instance_watcher
2023-07-26T06:46:58.514+0000 c8db640 20 cephfs::mirror::InstanceWatcher shutdown
2023-07-26T06:46:58.514+0000 c8db640 20 cephfs::mirror::InstanceWatcher unregister_watcher
2023-07-26T06:46:58.514+0000 c8db640 20 cephfs::mirror::Watcher unregister_watch
2023-07-26T06:46:58.514+0000 c8db640 20 cephfs::mirror::InstanceWatcher handle_unregister_watcher: r=0
2023-07-26T06:46:58.514+0000 c8db640 20 cephfs::mirror::InstanceWatcher remove_instance
2023-07-26T06:46:58.516+0000 189b1640 20 cephfs::mirror::InstanceWatcher handle_remove_instance: r=-108
2023-07-26T06:46:58.516+0000 c8db640 20 cephfs::mirror::FSMirror handle_shutdown_instance_watcher: r=-108
2023-07-26T06:46:58.516+0000 c8db640 20 cephfs::mirror::FSMirror cleanup
2023-07-26T06:46:59.595+0000 c8db640 20 cephfs::mirror::FSMirror ~FSMirror
2023-07-26T06:46:59.595+0000 c8db640 10 cephfs::mirror::Mirror enable_mirroring: starting FSMirror: filesystem={fscid=30, fs_name=cephfs}
2023-07-26T06:46:59.596+0000 c8db640 10 cephfs::mirror::ServiceDaemon: 0x8fe9d50 add_or_update_fs_attribute: fscid=30
2023-07-26T06:46:59.596+0000 c8db640 10 cephfs::mirror::ServiceDaemon: 0x8fe9d50 schedule_update_status
2023-07-26T06:46:59.596+0000 c8db640 20 cephfs::mirror::FSMirror init
2023-07-26T06:46:59.596+0000 c8db640 20 cephfs::mirror::Utils connect: connecting to cluster=ceph, client=client.mirror, mon_host=
2023-07-26T06:46:59.955+0000 c8db640 10 cephfs::mirror::Utils connect: using mon addr=172.21.15.162
2023-07-26T06:47:00.080+0000 c8db640 -1 asok(0x6b581f0) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/cephfs-mirror.asok': (17) File exists
2023-07-26T06:47:00.259+0000 c8db640 10 cephfs::mirror::Utils connect: connected to cluster=ceph using client=client.mirror
2023-07-26T06:47:00.317+0000 c8db640 20 cephfs::mirror::Utils mount: filesystem={fscid=30, fs_name=cephfs}
2023-07-26T06:47:00.553+0000 d0dc640 20 cephfs::mirror::Mirror update_fs_mirrors
2023-07-26T06:47:00.604+0000 110e4640 20 cephfs::mirror::ServiceDaemon: 0x8fe9d50 update_status: 1 filesystem(s)

Post blocklisting, the mirror daemon tried to reinitialize and got stuck in mounting the (local) file system. Could you check why that happened?

#29 Updated by Patrick Donnelly 5 months ago

  • Target version deleted (v18.0.0)

Also available in: Atom PDF