Project

General

Profile

Actions

Bug #64711

open

Test failure: test_cephfs_mirror_cancel_mirroring_and_readd (tasks.cephfs.test_mirroring.TestMirroring)

Added by Venky Shankar about 2 months ago. Updated about 1 month ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
quincy,reef,squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
cephfs-mirror
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/vshankar-2024-03-04_08:26:39-fs-wip-vshankar-testing-20240304.042522-testing-default-smithi/7580933

Probably a racy check

2024-03-04T10:56:28.494 INFO:tasks.cephfs_test_runner:======================================================================
2024-03-04T10:56:28.494 INFO:tasks.cephfs_test_runner:FAIL: test_cephfs_mirror_cancel_mirroring_and_readd (tasks.cephfs.test_mirroring.TestMirroring)
2024-03-04T10:56:28.494 INFO:tasks.cephfs_test_runner:Test adding a directory path for synchronization post removal of already added directory paths
2024-03-04T10:56:28.494 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2024-03-04T10:56:28.494 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2024-03-04T10:56:28.495 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_717ce3cce8d0166e4e577faa8004e6f9cb4128c0/qa/tasks/cephfs/test_mirroring.py", line 1276, in test_cephfs_mirror_cancel_mirroring_and_readd
2024-03-04T10:56:28.495 INFO:tasks.cephfs_test_runner:    self.check_peer_snap_in_progress(self.primary_fs_name, self.primary_fs_id,
2024-03-04T10:56:28.495 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_717ce3cce8d0166e4e577faa8004e6f9cb4128c0/qa/tasks/cephfs/test_mirroring.py", line 231, in check_peer_snap_in_progress
2024-03-04T10:56:28.495 INFO:tasks.cephfs_test_runner:    self.assertTrue('syncing' == res[dir_name]['state'])
2024-03-04T10:56:28.495 INFO:tasks.cephfs_test_runner:AssertionError: False is not true

Files

unnamed.png (335 KB) unnamed.png Jos Collin, 03/06/2024 10:23 AM
Actions #1

Updated by Jos Collin about 2 months ago

The check_peer_snap_in_progress() doesn't wait for 'syncing' to appear. It just checks the state at the moment and returns. Probably, it returned before the replayer sets the 'syncing'.

Actions #2

Updated by Jos Collin about 2 months ago

  • Status changed from New to In Progress
Actions #3

Updated by Jos Collin about 2 months ago

2024-03-04T10:55:59.104 INFO:teuthology.orchestra.run.smithi105.stdout:{
2024-03-04T10:55:59.104 INFO:teuthology.orchestra.run.smithi105.stdout:    "/d0": {
2024-03-04T10:55:59.104 INFO:teuthology.orchestra.run.smithi105.stdout:        "state": "idle",
2024-03-04T10:55:59.104 INFO:teuthology.orchestra.run.smithi105.stdout:        "snaps_synced": 0,
2024-03-04T10:55:59.105 INFO:teuthology.orchestra.run.smithi105.stdout:        "snaps_deleted": 0,
2024-03-04T10:55:59.105 INFO:teuthology.orchestra.run.smithi105.stdout:        "snaps_renamed": 0
2024-03-04T10:55:59.105 INFO:teuthology.orchestra.run.smithi105.stdout:    },
2024-03-04T10:55:59.105 INFO:teuthology.orchestra.run.smithi105.stdout:    "/d1": {
2024-03-04T10:55:59.105 INFO:teuthology.orchestra.run.smithi105.stdout:        "state": "idle",
2024-03-04T10:55:59.105 INFO:teuthology.orchestra.run.smithi105.stdout:        "snaps_synced": 0,
2024-03-04T10:55:59.105 INFO:teuthology.orchestra.run.smithi105.stdout:        "snaps_deleted": 0,
2024-03-04T10:55:59.105 INFO:teuthology.orchestra.run.smithi105.stdout:        "snaps_renamed": 0
2024-03-04T10:55:59.105 INFO:teuthology.orchestra.run.smithi105.stdout:    },
2024-03-04T10:55:59.105 INFO:teuthology.orchestra.run.smithi105.stdout:    "/d2": {
2024-03-04T10:55:59.105 INFO:teuthology.orchestra.run.smithi105.stdout:        "state": "syncing",
2024-03-04T10:55:59.105 INFO:teuthology.orchestra.run.smithi105.stdout:        "current_sycning_snap": {
2024-03-04T10:55:59.105 INFO:teuthology.orchestra.run.smithi105.stdout:            "id": 4,
2024-03-04T10:55:59.105 INFO:teuthology.orchestra.run.smithi105.stdout:            "name": "snap0" 
2024-03-04T10:55:59.105 INFO:teuthology.orchestra.run.smithi105.stdout:        },
2024-03-04T10:55:59.106 INFO:teuthology.orchestra.run.smithi105.stdout:        "snaps_synced": 0,
2024-03-04T10:55:59.106 INFO:teuthology.orchestra.run.smithi105.stdout:        "snaps_deleted": 0,
2024-03-04T10:55:59.106 INFO:teuthology.orchestra.run.smithi105.stdout:        "snaps_renamed": 0
2024-03-04T10:55:59.106 INFO:teuthology.orchestra.run.smithi105.stdout:    }
2024-03-04T10:55:59.106 INFO:teuthology.orchestra.run.smithi105.stdout:}
2024-03-04T10:55:59.124 DEBUG:tasks.cephfs.test_mirroring:command returned={
    "/d0": {
        "state": "idle",
        "snaps_synced": 0,
        "snaps_deleted": 0,
        "snaps_renamed": 0
    },
    "/d1": {
        "state": "idle",
        "snaps_synced": 0,
        "snaps_deleted": 0,
        "snaps_renamed": 0
    },
    "/d2": {
        "state": "syncing",
        "current_sycning_snap": {
            "id": 4,
            "name": "snap0" 
        },
        "snaps_synced": 0,
        "snaps_deleted": 0,
        "snaps_renamed": 0
    }
}

The snapshots are taken in the below order:

# take snapshots                                                                                                        
log.debug('taking snapshots')
self.mount_a.run_shell(["mkdir", "d0/.snap/snap0"])
self.mount_a.run_shell(["mkdir", "d1/.snap/snap0"])
self.mount_a.run_shell(["mkdir", "d2/.snap/snap0"])

Thus the d2 started "syncing" first, but the others (d0 and d1) didn't start syncing when we check the status of d0 first and check_peer_snap_in_progress asserts.

Actions #4

Updated by Jos Collin about 2 months ago

2024-03-04T10:55:50.942+0000 2c66b640 -1 cephfs::mirror::PeerReplayer(e9c703c7-88c9-40ed-b3f8-ff678976a731) should_sync_entry: \
      failed to stat prev entry= ./file.2: (9) Bad file descriptor
2024-03-04T10:55:52.370+0000 2ce6c640 -1 cephfs::mirror::PeerReplayer(e9c703c7-88c9-40ed-b3f8-ff678976a731) sync_snaps: failed \
      to sync snapshots for dir_root=/d0

Notice (the screenshot) the difference in should_sync_entry. First, it had r=-2, which is good for syncing d2. The next call to should_sync_entry for syncing d1, says 'Bad file descriptor'. The next one for syncing d0 shows r=-2. Looks like there's a race. Should we take a lock at the start of should_sync_entry?

Actions #5

Updated by Jos Collin about 2 months ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 56118
Actions #6

Updated by Venky Shankar about 1 month ago

Also seen in a Quincy run: /a/vshankar-2024-03-14_16:52:41-fs-wip-vshankar-testing1-quincy-2024-03-14-0655-quincy-testing-default-smithi/7600853

Actions

Also available in: Atom PDF