Bug #57087
openqa: test_fragmented_injection (tasks.cephfs.test_data_scan.TestDataScan) failure
0%
Description
Looks like the data scan has not recovered the file
2022-08-05T01:56:30.914 INFO:teuthology.orchestra.run:Running command with timeout 900 2022-08-05T01:56:30.915 DEBUG:teuthology.orchestra.run.smithi087:> (cd /home/ubuntu/cephtest/mnt.0 && exec sudo bash -c 'cat subdir/21') 2022-08-05T01:56:30.941 DEBUG:teuthology.orchestra.run:got remote process result: 1 2022-08-05T01:56:30.943 INFO:teuthology.orchestra.run.smithi087.stderr:cat: subdir/21: No such file or directory 2022-08-05T01:56:30.944 INFO:teuthology.nuke.actions:Clearing teuthology firewall rules...
Traceback
2022-08-05T01:56:40.301 INFO:tasks.cephfs_test_runner:test_fragmented_injection (tasks.cephfs.test_data_scan.TestDataScan) ... ERROR 2022-08-05T01:56:40.302 INFO:tasks.cephfs_test_runner: 2022-08-05T01:56:40.302 INFO:tasks.cephfs_test_runner:====================================================================== 2022-08-05T01:56:40.303 INFO:tasks.cephfs_test_runner:ERROR: test_fragmented_injection (tasks.cephfs.test_data_scan.TestDataScan) 2022-08-05T01:56:40.303 INFO:tasks.cephfs_test_runner:---------------------------------------------------------------------- 2022-08-05T01:56:40.304 INFO:tasks.cephfs_test_runner:Traceback (most recent call last): 2022-08-05T01:56:40.304 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/github.com_ceph_ceph-c_cee46c3e5b9015d27983e08f8ebddfb22d21d78e/qa/tasks/cephfs/test_data_scan.py", line 495, in test_fragmented_injection 2022-08-05T01:56:40.306 INFO:tasks.cephfs_test_runner: out = self.mount_a.run_shell_payload(f"cat subdir/{victim_dentry}", sudo=True).stdout.getvalue().strip() 2022-08-05T01:56:40.306 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/github.com_ceph_ceph-c_cee46c3e5b9015d27983e08f8ebddfb22d21d78e/qa/tasks/cephfs/mount.py", line 675, in run_shell_payload 2022-08-05T01:56:40.307 INFO:tasks.cephfs_test_runner: return self.run_shell(["bash", "-c", Raw(f"'{payload}'")], **kwargs) 2022-08-05T01:56:40.308 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/github.com_ceph_ceph-c_cee46c3e5b9015d27983e08f8ebddfb22d21d78e/qa/tasks/cephfs/mount.py", line 672, in run_shell 2022-08-05T01:56:40.309 INFO:tasks.cephfs_test_runner: return self.client_remote.run(args=args, cwd=cwd, timeout=timeout, stdout=stdout, stderr=stderr, **kwargs) 2022-08-05T01:56:40.309 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_git_teuthology_9e4bf63f00ab3f82a26635c2779f4c3f1b73fb53/teuthology/orchestra/remote.py", line 510, in run 2022-08-05T01:56:40.310 INFO:tasks.cephfs_test_runner: r = self._runner(client=self.ssh, name=self.shortname, **kwargs) 2022-08-05T01:56:40.310 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_git_teuthology_9e4bf63f00ab3f82a26635c2779f4c3f1b73fb53/teuthology/orchestra/run.py", line 455, in run 2022-08-05T01:56:40.311 INFO:tasks.cephfs_test_runner: r.wait() 2022-08-05T01:56:40.311 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_git_teuthology_9e4bf63f00ab3f82a26635c2779f4c3f1b73fb53/teuthology/orchestra/run.py", line 161, in wait 2022-08-05T01:56:40.312 INFO:tasks.cephfs_test_runner: self._raise_for_status() 2022-08-05T01:56:40.312 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_git_teuthology_9e4bf63f00ab3f82a26635c2779f4c3f1b73fb53/teuthology/orchestra/run.py", line 183, in _raise_for_status 2022-08-05T01:56:40.313 INFO:tasks.cephfs_test_runner: node=self.hostname, label=self.label 2022-08-05T01:56:40.313 INFO:tasks.cephfs_test_runner:teuthology.exceptions.CommandFailedError: Command failed on smithi087 with status 1: "(cd /home/ubuntu/cephtest/mnt.0 && exec sudo bash -c 'cat subdir/21')"
Updated by Kotresh Hiremath Ravishankar over 1 year ago
- Description updated (diff)
Updated by Kotresh Hiremath Ravishankar over 1 year ago
Note that the test successfully passed on the re-run
https://pulpito.ceph.com/yuriw-2022-08-10_20:34:29-fs-wip-yuri6-testing-2022-08-04-0617-pacific-distro-default-smithi/6966101
The PRs included in this build for testing are as follows:
1. https://github.com/ceph/ceph/pull/46901 - pacific: qa/cephfs: fallback to older way of get_op_read_count
2. https://github.com/ceph/ceph/pull/47282 - pacific: mds: standby-replay daemon always removed in MDSMonitor::prepare_beacon
3. https://github.com/ceph/ceph/pull/47307 - pacific: mgr/telemetry: reset health warning after re-opting-in
4. https://github.com/ceph/ceph/pull/47369 - pacific: mgr/volumes: Fix subvolume creation in FIPS enabled system.
Updated by Venky Shankar over 1 year ago
- Category set to Administration/Usability
- Status changed from New to Triaged
- Assignee set to Milind Changire
- Target version set to v18.0.0
Updated by Venky Shankar over 1 year ago
- Severity changed from 3 - minor to 2 - major
- Component(FS) MDS added
- Labels (FS) deleted (
qa, qa-failure)
Updated by Venky Shankar over 1 year ago
- Related to Bug #58221: pacific: Test failure: test_fragmented_injection (tasks.cephfs.test_data_scan.TestDataScan) added
Updated by Venky Shankar over 1 year ago
- Related to Bug #55537: mds: crash during fs:upgrade test added
Updated by Venky Shankar about 1 year ago
Seen in main branch integration test: https://pulpito.ceph.com/vshankar-2023-03-08_15:12:36-fs-wip-vshankar-testing-20230308.112059-testing-default-smithi/7197058
Milind, PTAL.
Updated by Venky Shankar 5 months ago
- Assignee changed from Milind Changire to Venky Shankar
- Target version set to v19.0.0
- Backport set to quincy,reef
Milind, I'm taking this one.
Updated by Venky Shankar 5 months ago
The interesting bit from mds log ./remote/smithi081/log/ceph-mds.c.log.gz
2023-11-08T16:42:00.284+0000 7f7efc048640 10 mds.0.server frag * offset '' offset_hash 0 flags 1 2023-11-08T16:42:00.284+0000 7f7efc048640 10 mds.0.server handle_client_readdir on [dir 0x10000000000 /subdir/ [2,head] auth v=304 cv=0/0 state=1073741824 f(v0 99=99+0)/f(v0 m2023-11-08T16:41:16.405318+0000 99=99+0) n(v0 rc2023-11-08T16:41:16.409448+0000 b188 99=99+0) hs=0+0,ss=0+0 0x55e6119eb200] 2023-11-08T16:42:00.284+0000 7f7efc048640 10 mds.0.server incomplete dir contents for readdir on [dir 0x10000000000 /subdir/ [2,head] auth v=304 cv=0/0 state=1073741824 f(v0 99=99+0)/f(v0 m2023-11-08T16:41:16.405318+0000 99=99+0) n(v0 rc2023-11-08T16:41:16.409448+0000 b188 99=99+0) hs=0+0,ss=0+0 0x55e6119eb200], fetching 2023-11-08T16:42:00.284+0000 7f7efc048640 10 mds.0.cache.dir(0x10000000000) fetch on [dir 0x10000000000 /subdir/ [2,head] auth v=304 cv=0/0 state=1073741824 f(v0 99=99+0)/f(v0 m2023-11-08T16:41:16.405318+0000 99=99+0) n(v0 rc2023-11-08T16:41:16.409448+0000 b188 99=99+0) hs=0+0,ss=0+0 0x55e6119eb200] 2023-11-08T16:42:00.284+0000 7f7efc048640 10 mds.0.cache.dir(0x10000000000) auth_pin by 0x55e6119eb200 on [dir 0x10000000000 /subdir/ [2,head] auth v=304 cv=0/0 ap=1+0 state=1073741824 f(v0 99=99+0)/f(v0 m2023-11-08T16:41:16.405318+0000 99=99+0) n(v0 rc2023-11-08T16:41:16.409448+0000 b188 99=99+0) hs=0+0,ss=0+0 | waiter=1 authpin=1 0x55e6119eb200] count now 1
According to the dir debug, the directory is considered not to be fragmented (notice the lack of 0* suffix - should be something like 0x10000000000.0*). The directory however, was fragmented.
Updated by Venky Shankar 5 months ago
So what happens is, when the MDSs are stopped before the disaster recovery steps can be run (scan_extents, etc..), the MDS merges the dirfrags (possibly during shut down), although it should not since mds_bal_merge_size is set to 0. When running the disaster recover step, the omap is injected in the 2nd frag (1*) possible creating the rados object too. However, the MDS has merged the fragments and the directory is not considered fragmented, hence the readdir fetches entries from a single directory (rados) object.
Updated by Venky Shankar 5 months ago
This isn't as worse as I made it to sound in #note-10 - just a qa thing. Fix coming up...
Updated by Venky Shankar 5 months ago
- Status changed from Triaged to Fix Under Review
- Pull request ID set to 54590
Updated by Rishabh Dave 3 months ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot 3 months ago
- Copied to Backport #64046: quincy: qa: test_fragmented_injection (tasks.cephfs.test_data_scan.TestDataScan) failure added
Updated by Backport Bot 3 months ago
- Copied to Backport #64047: reef: qa: test_fragmented_injection (tasks.cephfs.test_data_scan.TestDataScan) failure added