Project

General

Profile

Actions

Bug #57087

open

qa: test_fragmented_injection (tasks.cephfs.test_data_scan.TestDataScan) failure

Added by Kotresh Hiremath Ravishankar over 1 year ago. Updated 3 months ago.

Status:
Pending Backport
Priority:
Normal
Assignee:
Category:
Administration/Usability
Target version:
% Done:

0%

Source:
Tags:
backport_processed
Backport:
quincy,reef
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS, tools
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Seen in https://pulpito.ceph.com/yuriw-2022-08-04_20:54:08-fs-wip-yuri6-testing-2022-08-04-0617-pacific-distro-default-smithi/6959187

Looks like the data scan has not recovered the file

2022-08-05T01:56:30.914 INFO:teuthology.orchestra.run:Running command with timeout 900
2022-08-05T01:56:30.915 DEBUG:teuthology.orchestra.run.smithi087:> (cd /home/ubuntu/cephtest/mnt.0 && exec sudo bash -c 'cat subdir/21')
2022-08-05T01:56:30.941 DEBUG:teuthology.orchestra.run:got remote process result: 1
2022-08-05T01:56:30.943 INFO:teuthology.orchestra.run.smithi087.stderr:cat: subdir/21: No such file or directory
2022-08-05T01:56:30.944 INFO:teuthology.nuke.actions:Clearing teuthology firewall rules...

Traceback

2022-08-05T01:56:40.301 INFO:tasks.cephfs_test_runner:test_fragmented_injection (tasks.cephfs.test_data_scan.TestDataScan) ... ERROR
2022-08-05T01:56:40.302 INFO:tasks.cephfs_test_runner:
2022-08-05T01:56:40.302 INFO:tasks.cephfs_test_runner:======================================================================
2022-08-05T01:56:40.303 INFO:tasks.cephfs_test_runner:ERROR: test_fragmented_injection (tasks.cephfs.test_data_scan.TestDataScan)
2022-08-05T01:56:40.303 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2022-08-05T01:56:40.304 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2022-08-05T01:56:40.304 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_cee46c3e5b9015d27983e08f8ebddfb22d21d78e/qa/tasks/cephfs/test_data_scan.py", line 495, in test_fragmented_injection
2022-08-05T01:56:40.306 INFO:tasks.cephfs_test_runner:    out = self.mount_a.run_shell_payload(f"cat subdir/{victim_dentry}", sudo=True).stdout.getvalue().strip()
2022-08-05T01:56:40.306 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_cee46c3e5b9015d27983e08f8ebddfb22d21d78e/qa/tasks/cephfs/mount.py", line 675, in run_shell_payload
2022-08-05T01:56:40.307 INFO:tasks.cephfs_test_runner:    return self.run_shell(["bash", "-c", Raw(f"'{payload}'")], **kwargs)
2022-08-05T01:56:40.308 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_cee46c3e5b9015d27983e08f8ebddfb22d21d78e/qa/tasks/cephfs/mount.py", line 672, in run_shell
2022-08-05T01:56:40.309 INFO:tasks.cephfs_test_runner:    return self.client_remote.run(args=args, cwd=cwd, timeout=timeout, stdout=stdout, stderr=stderr, **kwargs)
2022-08-05T01:56:40.309 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_git_teuthology_9e4bf63f00ab3f82a26635c2779f4c3f1b73fb53/teuthology/orchestra/remote.py", line 510, in run
2022-08-05T01:56:40.310 INFO:tasks.cephfs_test_runner:    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
2022-08-05T01:56:40.310 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_git_teuthology_9e4bf63f00ab3f82a26635c2779f4c3f1b73fb53/teuthology/orchestra/run.py", line 455, in run
2022-08-05T01:56:40.311 INFO:tasks.cephfs_test_runner:    r.wait()
2022-08-05T01:56:40.311 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_git_teuthology_9e4bf63f00ab3f82a26635c2779f4c3f1b73fb53/teuthology/orchestra/run.py", line 161, in wait
2022-08-05T01:56:40.312 INFO:tasks.cephfs_test_runner:    self._raise_for_status()
2022-08-05T01:56:40.312 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_git_teuthology_9e4bf63f00ab3f82a26635c2779f4c3f1b73fb53/teuthology/orchestra/run.py", line 183, in _raise_for_status
2022-08-05T01:56:40.313 INFO:tasks.cephfs_test_runner:    node=self.hostname, label=self.label
2022-08-05T01:56:40.313 INFO:tasks.cephfs_test_runner:teuthology.exceptions.CommandFailedError: Command failed on smithi087 with status 1: "(cd /home/ubuntu/cephtest/mnt.0 && exec sudo bash -c 'cat subdir/21')" 

Related issues 4 (2 open2 closed)

Related to CephFS - Bug #58221: pacific: Test failure: test_fragmented_injection (tasks.cephfs.test_data_scan.TestDataScan)Duplicate

Actions
Related to CephFS - Bug #55537: mds: crash during fs:upgrade testTriagedVenky Shankar

Actions
Copied to CephFS - Backport #64046: quincy: qa: test_fragmented_injection (tasks.cephfs.test_data_scan.TestDataScan) failureIn ProgressVenky ShankarActions
Copied to CephFS - Backport #64047: reef: qa: test_fragmented_injection (tasks.cephfs.test_data_scan.TestDataScan) failureResolvedVenky ShankarActions
Actions #1

Updated by Kotresh Hiremath Ravishankar over 1 year ago

  • Description updated (diff)
Actions #2

Updated by Kotresh Hiremath Ravishankar over 1 year ago

Note that the test successfully passed on the re-run
https://pulpito.ceph.com/yuriw-2022-08-10_20:34:29-fs-wip-yuri6-testing-2022-08-04-0617-pacific-distro-default-smithi/6966101

The PRs included in this build for testing are as follows:

1. https://github.com/ceph/ceph/pull/46901 - pacific: qa/cephfs: fallback to older way of get_op_read_count
2. https://github.com/ceph/ceph/pull/47282 - pacific: mds: standby-replay daemon always removed in MDSMonitor::prepare_beacon
3. https://github.com/ceph/ceph/pull/47307 - pacific: mgr/telemetry: reset health warning after re-opting-in
4. https://github.com/ceph/ceph/pull/47369 - pacific: mgr/volumes: Fix subvolume creation in FIPS enabled system.

Actions #3

Updated by Venky Shankar over 1 year ago

  • Category set to Administration/Usability
  • Status changed from New to Triaged
  • Assignee set to Milind Changire
  • Target version set to v18.0.0
Actions #4

Updated by Venky Shankar over 1 year ago

  • Severity changed from 3 - minor to 2 - major
  • Component(FS) MDS added
  • Labels (FS) deleted (qa, qa-failure)
Actions #5

Updated by Venky Shankar over 1 year ago

  • Related to Bug #58221: pacific: Test failure: test_fragmented_injection (tasks.cephfs.test_data_scan.TestDataScan) added
Actions #6

Updated by Venky Shankar over 1 year ago

  • Related to Bug #55537: mds: crash during fs:upgrade test added
Actions #8

Updated by Patrick Donnelly 7 months ago

  • Target version deleted (v18.0.0)
Actions #9

Updated by Venky Shankar 5 months ago

  • Assignee changed from Milind Changire to Venky Shankar
  • Target version set to v19.0.0
  • Backport set to quincy,reef
Actions #10

Updated by Venky Shankar 5 months ago

The interesting bit from mds log ./remote/smithi081/log/ceph-mds.c.log.gz

2023-11-08T16:42:00.284+0000 7f7efc048640 10 mds.0.server  frag * offset '' offset_hash 0 flags 1
2023-11-08T16:42:00.284+0000 7f7efc048640 10 mds.0.server handle_client_readdir on [dir 0x10000000000 /subdir/ [2,head] auth v=304 cv=0/0 state=1073741824 f(v0 99=99+0)/f(v0 m2023-11-08T16:41:16.405318+0000 99=99+0) n(v0 rc2023-11-08T16:41:16.409448+0000 b188 99=99+0) hs=0+0,ss=0+0 0x55e6119eb200]
2023-11-08T16:42:00.284+0000 7f7efc048640 10 mds.0.server  incomplete dir contents for readdir on [dir 0x10000000000 /subdir/ [2,head] auth v=304 cv=0/0 state=1073741824 f(v0 99=99+0)/f(v0 m2023-11-08T16:41:16.405318+0000 99=99+0) n(v0 rc2023-11-08T16:41:16.409448+0000 b188 99=99+0) hs=0+0,ss=0+0 0x55e6119eb200], fetching
2023-11-08T16:42:00.284+0000 7f7efc048640 10 mds.0.cache.dir(0x10000000000) fetch on [dir 0x10000000000 /subdir/ [2,head] auth v=304 cv=0/0 state=1073741824 f(v0 99=99+0)/f(v0 m2023-11-08T16:41:16.405318+0000 99=99+0) n(v0 rc2023-11-08T16:41:16.409448+0000 b188 99=99+0) hs=0+0,ss=0+0 0x55e6119eb200]
2023-11-08T16:42:00.284+0000 7f7efc048640 10 mds.0.cache.dir(0x10000000000) auth_pin by 0x55e6119eb200 on [dir 0x10000000000 /subdir/ [2,head] auth v=304 cv=0/0 ap=1+0 state=1073741824 f(v0 99=99+0)/f(v0 m2023-11-08T16:41:16.405318+0000 99=99+0) n(v0 rc2023-11-08T16:41:16.409448+0000 b188 99=99+0) hs=0+0,ss=0+0 | waiter=1 authpin=1 0x55e6119eb200] count now 1

According to the dir debug, the directory is considered not to be fragmented (notice the lack of 0* suffix - should be something like 0x10000000000.0*). The directory however, was fragmented.

Actions #11

Updated by Venky Shankar 5 months ago

So what happens is, when the MDSs are stopped before the disaster recovery steps can be run (scan_extents, etc..), the MDS merges the dirfrags (possibly during shut down), although it should not since mds_bal_merge_size is set to 0. When running the disaster recover step, the omap is injected in the 2nd frag (1*) possible creating the rados object too. However, the MDS has merged the fragments and the directory is not considered fragmented, hence the readdir fetches entries from a single directory (rados) object.

Actions #12

Updated by Venky Shankar 5 months ago

This isn't as worse as I made it to sound in #note-10 - just a qa thing. Fix coming up...

Actions #13

Updated by Venky Shankar 5 months ago

  • Status changed from Triaged to Fix Under Review
  • Pull request ID set to 54590
Actions #14

Updated by Rishabh Dave 3 months ago

  • Status changed from Fix Under Review to Pending Backport
Actions #15

Updated by Backport Bot 3 months ago

  • Copied to Backport #64046: quincy: qa: test_fragmented_injection (tasks.cephfs.test_data_scan.TestDataScan) failure added
Actions #16

Updated by Backport Bot 3 months ago

  • Copied to Backport #64047: reef: qa: test_fragmented_injection (tasks.cephfs.test_data_scan.TestDataScan) failure added
Actions #17

Updated by Backport Bot 3 months ago

  • Tags set to backport_processed
Actions

Also available in: Atom PDF