Bug #57087: qa: test_fragmented_injection (tasks.cephfs.test_data_scan.TestDataScan) failure - CephFS - Ceph

Actions

Copy link

Bug #57087

open

qa: test_fragmented_injection (tasks.cephfs.test_data_scan.TestDataScan) failure

Added by Kotresh Hiremath Ravishankar over 1 year ago. Updated 3 months ago.

Status:

Pending Backport

Priority:

Normal

Assignee:

Venky Shankar

Category:

Administration/Usability

Target version:

Ceph - v19.0.0

% Done:

Source:

Tags:

backport_processed

Backport:

quincy,reef

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS, tools

Labels (FS):

Pull request ID:

54590

Crash signature (v1):

Crash signature (v2):

Description

Seen in https://pulpito.ceph.com/yuriw-2022-08-04_20:54:08-fs-wip-yuri6-testing-2022-08-04-0617-pacific-distro-default-smithi/6959187

Looks like the data scan has not recovered the file

2022-08-05T01:56:30.914 INFO:teuthology.orchestra.run:Running command with timeout 900
2022-08-05T01:56:30.915 DEBUG:teuthology.orchestra.run.smithi087:> (cd /home/ubuntu/cephtest/mnt.0 && exec sudo bash -c 'cat subdir/21')
2022-08-05T01:56:30.941 DEBUG:teuthology.orchestra.run:got remote process result: 1
2022-08-05T01:56:30.943 INFO:teuthology.orchestra.run.smithi087.stderr:cat: subdir/21: No such file or directory
2022-08-05T01:56:30.944 INFO:teuthology.nuke.actions:Clearing teuthology firewall rules...

Traceback

2022-08-05T01:56:40.301 INFO:tasks.cephfs_test_runner:test_fragmented_injection (tasks.cephfs.test_data_scan.TestDataScan) ... ERROR
2022-08-05T01:56:40.302 INFO:tasks.cephfs_test_runner:
2022-08-05T01:56:40.302 INFO:tasks.cephfs_test_runner:======================================================================
2022-08-05T01:56:40.303 INFO:tasks.cephfs_test_runner:ERROR: test_fragmented_injection (tasks.cephfs.test_data_scan.TestDataScan)
2022-08-05T01:56:40.303 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2022-08-05T01:56:40.304 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2022-08-05T01:56:40.304 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_cee46c3e5b9015d27983e08f8ebddfb22d21d78e/qa/tasks/cephfs/test_data_scan.py", line 495, in test_fragmented_injection
2022-08-05T01:56:40.306 INFO:tasks.cephfs_test_runner:    out = self.mount_a.run_shell_payload(f"cat subdir/{victim_dentry}", sudo=True).stdout.getvalue().strip()
2022-08-05T01:56:40.306 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_cee46c3e5b9015d27983e08f8ebddfb22d21d78e/qa/tasks/cephfs/mount.py", line 675, in run_shell_payload
2022-08-05T01:56:40.307 INFO:tasks.cephfs_test_runner:    return self.run_shell(["bash", "-c", Raw(f"'{payload}'")], **kwargs)
2022-08-05T01:56:40.308 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_cee46c3e5b9015d27983e08f8ebddfb22d21d78e/qa/tasks/cephfs/mount.py", line 672, in run_shell
2022-08-05T01:56:40.309 INFO:tasks.cephfs_test_runner:    return self.client_remote.run(args=args, cwd=cwd, timeout=timeout, stdout=stdout, stderr=stderr, **kwargs)
2022-08-05T01:56:40.309 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_git_teuthology_9e4bf63f00ab3f82a26635c2779f4c3f1b73fb53/teuthology/orchestra/remote.py", line 510, in run
2022-08-05T01:56:40.310 INFO:tasks.cephfs_test_runner:    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
2022-08-05T01:56:40.310 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_git_teuthology_9e4bf63f00ab3f82a26635c2779f4c3f1b73fb53/teuthology/orchestra/run.py", line 455, in run
2022-08-05T01:56:40.311 INFO:tasks.cephfs_test_runner:    r.wait()
2022-08-05T01:56:40.311 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_git_teuthology_9e4bf63f00ab3f82a26635c2779f4c3f1b73fb53/teuthology/orchestra/run.py", line 161, in wait
2022-08-05T01:56:40.312 INFO:tasks.cephfs_test_runner:    self._raise_for_status()
2022-08-05T01:56:40.312 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_git_teuthology_9e4bf63f00ab3f82a26635c2779f4c3f1b73fb53/teuthology/orchestra/run.py", line 183, in _raise_for_status
2022-08-05T01:56:40.313 INFO:tasks.cephfs_test_runner:    node=self.hostname, label=self.label
2022-08-05T01:56:40.313 INFO:tasks.cephfs_test_runner:teuthology.exceptions.CommandFailedError: Command failed on smithi087 with status 1: "(cd /home/ubuntu/cephtest/mnt.0 && exec sudo bash -c 'cat subdir/21')"

Related issues 4 (2 open — 2 closed)

Actions

Copy link

Updated by Kotresh Hiremath Ravishankar over 1 year ago

Description updated (diff)

Actions

Copy link

Updated by Kotresh Hiremath Ravishankar over 1 year ago

Note that the test successfully passed on the re-run
https://pulpito.ceph.com/yuriw-2022-08-10_20:34:29-fs-wip-yuri6-testing-2022-08-04-0617-pacific-distro-default-smithi/6966101

The PRs included in this build for testing are as follows:

1. https://github.com/ceph/ceph/pull/46901 - pacific: qa/cephfs: fallback to older way of get_op_read_count
2. https://github.com/ceph/ceph/pull/47282 - pacific: mds: standby-replay daemon always removed in MDSMonitor::prepare_beacon
3. https://github.com/ceph/ceph/pull/47307 - pacific: mgr/telemetry: reset health warning after re-opting-in
4. https://github.com/ceph/ceph/pull/47369 - pacific: mgr/volumes: Fix subvolume creation in FIPS enabled system.

Actions

Copy link

Updated by Venky Shankar over 1 year ago

Category set to Administration/Usability
Status changed from New to Triaged
Assignee set to Milind Changire
Target version set to v18.0.0

Actions

Copy link

Updated by Venky Shankar over 1 year ago

Severity changed from 3 - minor to 2 - major
Component(FS) MDS added
Labels (FS) deleted (~~qa, qa-failure~~)

Actions

Copy link

Updated by Venky Shankar over 1 year ago

Related to Bug #58221: pacific: Test failure: test_fragmented_injection (tasks.cephfs.test_data_scan.TestDataScan) added

Actions

Copy link

Updated by Venky Shankar over 1 year ago

Related to Bug #55537: mds: crash during fs:upgrade test added

Actions

Copy link

Updated by Venky Shankar about 1 year ago

Seen in main branch integration test: https://pulpito.ceph.com/vshankar-2023-03-08_15:12:36-fs-wip-vshankar-testing-20230308.112059-testing-default-smithi/7197058

Milind, PTAL.

Actions

Copy link

Updated by Patrick Donnelly 7 months ago

Target version deleted (~~v18.0.0~~)

Actions

Copy link

Updated by Venky Shankar 5 months ago

Assignee changed from Milind Changire to Venky Shankar
Target version set to v19.0.0
Backport set to quincy,reef

Seen again - https://pulpito.ceph.com/vshankar-2023-11-06_10:33:49-fs-wip-vshankar-testing-20231106.073650-testing-default-smithi/7449595/

Milind, I'm taking this one.

Actions

Copy link

#10

Updated by Venky Shankar 5 months ago

The interesting bit from mds log ./remote/smithi081/log/ceph-mds.c.log.gz

2023-11-08T16:42:00.284+0000 7f7efc048640 10 mds.0.server  frag * offset '' offset_hash 0 flags 1
2023-11-08T16:42:00.284+0000 7f7efc048640 10 mds.0.server handle_client_readdir on [dir 0x10000000000 /subdir/ [2,head] auth v=304 cv=0/0 state=1073741824 f(v0 99=99+0)/f(v0 m2023-11-08T16:41:16.405318+0000 99=99+0) n(v0 rc2023-11-08T16:41:16.409448+0000 b188 99=99+0) hs=0+0,ss=0+0 0x55e6119eb200]
2023-11-08T16:42:00.284+0000 7f7efc048640 10 mds.0.server  incomplete dir contents for readdir on [dir 0x10000000000 /subdir/ [2,head] auth v=304 cv=0/0 state=1073741824 f(v0 99=99+0)/f(v0 m2023-11-08T16:41:16.405318+0000 99=99+0) n(v0 rc2023-11-08T16:41:16.409448+0000 b188 99=99+0) hs=0+0,ss=0+0 0x55e6119eb200], fetching
2023-11-08T16:42:00.284+0000 7f7efc048640 10 mds.0.cache.dir(0x10000000000) fetch on [dir 0x10000000000 /subdir/ [2,head] auth v=304 cv=0/0 state=1073741824 f(v0 99=99+0)/f(v0 m2023-11-08T16:41:16.405318+0000 99=99+0) n(v0 rc2023-11-08T16:41:16.409448+0000 b188 99=99+0) hs=0+0,ss=0+0 0x55e6119eb200]
2023-11-08T16:42:00.284+0000 7f7efc048640 10 mds.0.cache.dir(0x10000000000) auth_pin by 0x55e6119eb200 on [dir 0x10000000000 /subdir/ [2,head] auth v=304 cv=0/0 ap=1+0 state=1073741824 f(v0 99=99+0)/f(v0 m2023-11-08T16:41:16.405318+0000 99=99+0) n(v0 rc2023-11-08T16:41:16.409448+0000 b188 99=99+0) hs=0+0,ss=0+0 | waiter=1 authpin=1 0x55e6119eb200] count now 1

According to the dir debug, the directory is considered not to be fragmented (notice the lack of 0* suffix - should be something like 0x10000000000.0*). The directory however, was fragmented.

Actions

Copy link

#11

Updated by Venky Shankar 5 months ago

So what happens is, when the MDSs are stopped before the disaster recovery steps can be run (scan_extents, etc..), the MDS merges the dirfrags (possibly during shut down), although it should not since mds_bal_merge_size is set to 0. When running the disaster recover step, the omap is injected in the 2nd frag (1*) possible creating the rados object too. However, the MDS has merged the fragments and the directory is not considered fragmented, hence the readdir fetches entries from a single directory (rados) object.

Actions

Copy link

#12