Bug #64711
openTest failure: test_cephfs_mirror_cancel_mirroring_and_readd (tasks.cephfs.test_mirroring.TestMirroring)
0%
Description
/a/vshankar-2024-03-04_08:26:39-fs-wip-vshankar-testing-20240304.042522-testing-default-smithi/7580933
Probably a racy check
2024-03-04T10:56:28.494 INFO:tasks.cephfs_test_runner:====================================================================== 2024-03-04T10:56:28.494 INFO:tasks.cephfs_test_runner:FAIL: test_cephfs_mirror_cancel_mirroring_and_readd (tasks.cephfs.test_mirroring.TestMirroring) 2024-03-04T10:56:28.494 INFO:tasks.cephfs_test_runner:Test adding a directory path for synchronization post removal of already added directory paths 2024-03-04T10:56:28.494 INFO:tasks.cephfs_test_runner:---------------------------------------------------------------------- 2024-03-04T10:56:28.494 INFO:tasks.cephfs_test_runner:Traceback (most recent call last): 2024-03-04T10:56:28.495 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_ceph-c_717ce3cce8d0166e4e577faa8004e6f9cb4128c0/qa/tasks/cephfs/test_mirroring.py", line 1276, in test_cephfs_mirror_cancel_mirroring_and_readd 2024-03-04T10:56:28.495 INFO:tasks.cephfs_test_runner: self.check_peer_snap_in_progress(self.primary_fs_name, self.primary_fs_id, 2024-03-04T10:56:28.495 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_ceph-c_717ce3cce8d0166e4e577faa8004e6f9cb4128c0/qa/tasks/cephfs/test_mirroring.py", line 231, in check_peer_snap_in_progress 2024-03-04T10:56:28.495 INFO:tasks.cephfs_test_runner: self.assertTrue('syncing' == res[dir_name]['state']) 2024-03-04T10:56:28.495 INFO:tasks.cephfs_test_runner:AssertionError: False is not true
Files
Updated by Jos Collin about 2 months ago
The check_peer_snap_in_progress() doesn't wait for 'syncing' to appear. It just checks the state at the moment and returns. Probably, it returned before the replayer sets the 'syncing'.
Updated by Jos Collin about 2 months ago
2024-03-04T10:55:59.104 INFO:teuthology.orchestra.run.smithi105.stdout:{ 2024-03-04T10:55:59.104 INFO:teuthology.orchestra.run.smithi105.stdout: "/d0": { 2024-03-04T10:55:59.104 INFO:teuthology.orchestra.run.smithi105.stdout: "state": "idle", 2024-03-04T10:55:59.104 INFO:teuthology.orchestra.run.smithi105.stdout: "snaps_synced": 0, 2024-03-04T10:55:59.105 INFO:teuthology.orchestra.run.smithi105.stdout: "snaps_deleted": 0, 2024-03-04T10:55:59.105 INFO:teuthology.orchestra.run.smithi105.stdout: "snaps_renamed": 0 2024-03-04T10:55:59.105 INFO:teuthology.orchestra.run.smithi105.stdout: }, 2024-03-04T10:55:59.105 INFO:teuthology.orchestra.run.smithi105.stdout: "/d1": { 2024-03-04T10:55:59.105 INFO:teuthology.orchestra.run.smithi105.stdout: "state": "idle", 2024-03-04T10:55:59.105 INFO:teuthology.orchestra.run.smithi105.stdout: "snaps_synced": 0, 2024-03-04T10:55:59.105 INFO:teuthology.orchestra.run.smithi105.stdout: "snaps_deleted": 0, 2024-03-04T10:55:59.105 INFO:teuthology.orchestra.run.smithi105.stdout: "snaps_renamed": 0 2024-03-04T10:55:59.105 INFO:teuthology.orchestra.run.smithi105.stdout: }, 2024-03-04T10:55:59.105 INFO:teuthology.orchestra.run.smithi105.stdout: "/d2": { 2024-03-04T10:55:59.105 INFO:teuthology.orchestra.run.smithi105.stdout: "state": "syncing", 2024-03-04T10:55:59.105 INFO:teuthology.orchestra.run.smithi105.stdout: "current_sycning_snap": { 2024-03-04T10:55:59.105 INFO:teuthology.orchestra.run.smithi105.stdout: "id": 4, 2024-03-04T10:55:59.105 INFO:teuthology.orchestra.run.smithi105.stdout: "name": "snap0" 2024-03-04T10:55:59.105 INFO:teuthology.orchestra.run.smithi105.stdout: }, 2024-03-04T10:55:59.106 INFO:teuthology.orchestra.run.smithi105.stdout: "snaps_synced": 0, 2024-03-04T10:55:59.106 INFO:teuthology.orchestra.run.smithi105.stdout: "snaps_deleted": 0, 2024-03-04T10:55:59.106 INFO:teuthology.orchestra.run.smithi105.stdout: "snaps_renamed": 0 2024-03-04T10:55:59.106 INFO:teuthology.orchestra.run.smithi105.stdout: } 2024-03-04T10:55:59.106 INFO:teuthology.orchestra.run.smithi105.stdout:} 2024-03-04T10:55:59.124 DEBUG:tasks.cephfs.test_mirroring:command returned={ "/d0": { "state": "idle", "snaps_synced": 0, "snaps_deleted": 0, "snaps_renamed": 0 }, "/d1": { "state": "idle", "snaps_synced": 0, "snaps_deleted": 0, "snaps_renamed": 0 }, "/d2": { "state": "syncing", "current_sycning_snap": { "id": 4, "name": "snap0" }, "snaps_synced": 0, "snaps_deleted": 0, "snaps_renamed": 0 } }
The snapshots are taken in the below order:
# take snapshots log.debug('taking snapshots') self.mount_a.run_shell(["mkdir", "d0/.snap/snap0"]) self.mount_a.run_shell(["mkdir", "d1/.snap/snap0"]) self.mount_a.run_shell(["mkdir", "d2/.snap/snap0"])
Thus the d2 started "syncing" first, but the others (d0 and d1) didn't start syncing when we check the status of d0 first and check_peer_snap_in_progress asserts.
Updated by Jos Collin about 2 months ago
- File unnamed.png unnamed.png added
2024-03-04T10:55:50.942+0000 2c66b640 -1 cephfs::mirror::PeerReplayer(e9c703c7-88c9-40ed-b3f8-ff678976a731) should_sync_entry: \ failed to stat prev entry= ./file.2: (9) Bad file descriptor 2024-03-04T10:55:52.370+0000 2ce6c640 -1 cephfs::mirror::PeerReplayer(e9c703c7-88c9-40ed-b3f8-ff678976a731) sync_snaps: failed \ to sync snapshots for dir_root=/d0
Notice (the screenshot) the difference in should_sync_entry. First, it had r=-2, which is good for syncing d2. The next call to should_sync_entry for syncing d1, says 'Bad file descriptor'. The next one for syncing d0 shows r=-2. Looks like there's a race. Should we take a lock at the start of should_sync_entry?
Updated by Jos Collin about 2 months ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 56118
Updated by Venky Shankar about 1 month ago
Also seen in a Quincy run: /a/vshankar-2024-03-14_16:52:41-fs-wip-vshankar-testing1-quincy-2024-03-14-0655-quincy-testing-default-smithi/7600853