Bug #48562
open
qa: scrub - object missing on disk; some files may be lost
Added by Milind Changire over 3 years ago.
Updated 29 days ago.
Category:
fsck/damage handling
Labels (FS):
crash, qa-failure, scrub
- Priority changed from Normal to Urgent
- Target version set to v16.0.0
- Source set to Q/A
- Component(FS) MDS added
- Labels (FS) qa-failure added
- Status changed from New to Triaged
- Assignee set to Milind Changire
- Target version changed from v16.0.0 to v17.0.0
- Backport set to pacific,octopus,nautilus
- Target version deleted (
v17.0.0)
- Status changed from Triaged to Closed
- Priority changed from Urgent to Low
closing tracker for now
lowering priority to low
please reopen in case this seen again
- Category set to fsck/damage handling
- Status changed from Closed to New
- Priority changed from Low to High
- Target version set to v20.0.0
- Backport changed from pacific,octopus,nautilus to squid,reef
/a/yuriw-2024-03-16_15:03:17-fs-wip-yuri10-testing-2024-03-15-1653-reef-distro-default-smithi/7606353
Is it okay to ignore dir/inode/dentry during scrub if there are corresponding projections active for them, implying that the element state is not stable and that any checks could potentially fail ?
Milind Changire wrote:
Is it okay to ignore dir/inode/dentry during scrub if there are corresponding projections active for them, implying that the element state is not stable and that any checks could potentially fail ?
Is this the underlying reason for the test failure? The projected state is an interim state, say for an inode, till it gets journaled, after which the projection is popped. At this point (esp. for an inode), the parent gets marked as dirty which is then checked by scrub to not consider the item as damaged.
According to qa/tasks/cephfs/test_forward_scrub.py the test that causes 'stat testdir/hardlink' to fail is test_health_status_after_dirfrag_repair.
However, there is no trace of teuthology ever starting this test in teuthlogy.log. This is odd. Can anybody explain this teuthology behavior.
Apart from the odd behavior mentioned above, the test test_health_status_after_dirfrag_repair intentionally deletes the RADOS object leading to the ERR log. So this might not be a cephfs failure at all.
BTW, where is the teuthology.log created when running the tests ?
Is it on the mounted cephfs volume ?
Milind Changire wrote:
According to qa/tasks/cephfs/test_forward_scrub.py the test that causes 'stat testdir/hardlink' to fail is test_health_status_after_dirfrag_repair.
However, there is no trace of teuthology ever starting this test in teuthlogy.log. This is odd. Can anybody explain this teuthology behavior.
/a/yuriw-2024-03-12_14:59:27-fs-wip-yuri11-testing-2024-03-11-0838-reef-distro-default-smithi/7593867 does have test_health_status_after_dirfrag_repair
2024-03-12T19:04:01.704 INFO:teuthology.orchestra.run.smithi092.stderr:2024-03-12T19:04:01.719+0000 7f640abb9640 1 -- 172.21.15.92:0/3726236821 --> [v2:172.21.15.92:3300/0,v1:172.21.15.92:6789/0] -- mon_command({"prefix": "log", "logtext": ["Ended test tasks.cephfs.test_
forward_scrub.TestForwardScrub.test_health_status_after_dirfrag_repair"]} v 0) v1 -- 0x7f64040b3500 con 0x7f64040b1680
2024-03-12T19:04:02.031 INFO:teuthology.orchestra.run.smithi092.stderr:2024-03-12T19:04:02.045+0000 7f64017fa640 1 -- 172.21.15.92:0/3726236821 <== mon.0 v2:172.21.15.92:3300/0 7 ==== mon_command_ack([{"prefix": "log", "logtext": ["Ended test tasks.cephfs.test_forward_scrub.TestForwardScrub.test_health_status_after_dirfrag_repair"]}]=0 v377) v1 ==== 167+0+0 (secure 0 0 0) 0x7f63fc018020 con 0x7f64040b1680
Apart from the odd behavior mentioned above, the test test_health_status_after_dirfrag_repair intentionally deletes the RADOS object leading to the ERR log. So this might not be a cephfs failure at all.
In that case, this warning needs to be ignore listed.
BTW, where is the teuthology.log created when running the tests ?
Is it on the mounted cephfs volume ?
I think yes.
- Status changed from New to Fix Under Review
- Assignee changed from Milind Changire to Venky Shankar
- Pull request ID set to 56699
- Labels (FS) crash added
Milind, I'm taking this one.
Also available in: Atom
PDF